ARTICLE
Communicated by David Zipser
Conversion of Temporal Correlations Between Stimuli to Spatial Correlations Between Attractors M. Griniasty M.V. TsodyW Racah lnstitute of Physics and Center for Neural Computation, Hebrew University, Jerusalem
Daniel J. Amit' INFN, Sezione di Roma, lstituto di Fisica, Universita di Roma, La Sapienza, Ple Aldo Moro, Roma, Italy
It is shown that a simple modification of synaptic structures (of the Hopfield type) constructed to produce autoassociative attractors, produces neural networks whose attractors are correlated with several (learned) patterns used in the construction of the matrix. The modification stores in the matrix a fixed sequence of uncorrelated patterns. The network then has correlated attractors, provoked by the uncorrelated stimuli. Thus, the network converts the temporal order (or temporal correlation) expressed by the sequence of patterns, into spatial correlations expressed in the distributions of neural activities in attractors. The model captures phenomena observed in single electrode recordings in performing monkeys by Miyashita et al. The correspondence is as close as to reproduce the fact that given uncorrelated patterns as sequentially leamed stimuli, the attractors produced are significantly correlated up to a separation of 5 (five) in the sequence. This number 5 is universal in a range of parameters, and requires essentially no tuning. We then discuss leaming scenarios that could lead to this synaptic structure as well as experimentalpredictions following from it. Finally, we speculate on the cognitive utility of such an arrangement. 1 Introduction
1.1 Temporal to Spatial Correlations in Monkey Cortex. The remarkable sequence of neurocognitive experiments by Miyashita (1988), Miyashita and Chang (19881, and Sakai and Miyashita (1991) is the most direct evidence of the relevance of attractor dynamics in cortical cognitive processing. It is at the same time detailed and structured enough 'On leave of absence from The Institute of Higher Nervous Activity, Moscow. +Onleave of absence from Racah Institute of Physics, Hebrew University, Jerusalem.
Neural Computation 5,1-17 (1993)
@ 1993 Massachusetts Institute of Technology
2
M. Griniasty, M. V,Tsodyks, and D. J. Amit
to guide and confront attractor neural network (ANN) modeling. In the first experiment (Miyashita and Chang 1988), the monkey is trained to recognize and match a set of visual patterns. As a result, one observes selective enhancement of neural spike activity, which persists for 16 sec after the removal of the stimulus. The fact that selective, stimulus-related, enhancement of neural activity persists for 16 sec in the absence of the provoking stimulus is evidence of nonergodic attractor dynamics (see, e.g., Amit 1992). The same encouraging evidence has forced a confrontation on the question of activity rates on retrieval by attractors. The rates in the Miyashita attractors were many times lower than what models of the Hopfield type (Hopfield 1982; Amit 1989) predicted. This fruitful confrontation led to a study (Amit and Tsodyks 1991) that showed that when neural description is taken in greater detail, as well as the conditions prevailing in cortex, attractors can appear having stochastic behavior and low rates. The second study (Miyashita 1988) went further to provide information about coding in the particular module of the anterior ventral temporal cortex of the monkey: It was discovered that despite extreme precaution in producing visual stimuli uncorrelated in their spatial form, spatial correlations appeared in the patterns of sustained activities, evoked by the stimuli, during the delay period. These persistent activities we interpret as the structure of the attractors. There was one kind of correlation that was preserved in the stimuli, the temporal order of their presentation was maintained fixed during training. What the monkey’s brain appears to be doing, is to convert the temporal correlation into a spatial one. Namely, spatial correlations were observed among the attractors, corresponding to the stimuli that were close temporally in the training session. These attractors are the result of retrieval dynamics. The spatial correlations between the activities of the neurons investigated persisted to a fifth neighbor in the temporal sequence. The correlation figure of Miyashita (1988) is reproduced in Figure 1. 1.2 Modeling Correlation Conversion. The main result of the Hopfield program has been to connect the intuitive call for selective (stimulus dependent) attractor dynamics (associative memory) with specific constructions of synaptic matrices, and therefore a bridge to unsupervised learning. The program was limited by the requirement that the attractors be as close as possible to the patterns of which the matrices were constructed, that is, the presumed items in the learning process. This went under the name of autoassociation. Here we shall show that a simple modification of the synaptic matrices used for autoassociation in ANNs leads to a relaxation dynamics that associates with stimuli near one of the random, uncorrelated underlying patterns, an attractor that is correlated with several patterns. The patterns that have the largest correlations with a given attractor are the neighbors of the stimulus leading to the attractor, in the sequence of stored patterns.
Conversion of Temporal to Spatial Correlations
c
3
0.4t
L.
c
.g -
-
0.3 -
F
6 0 0
-6
0.2
-
0.1
-
0
CI
g
0
0
-1
5
10th neighbour
Figure 1: Spatial correlations between attractors, in monkey's anterior ventral temporal cortex, corresponding to structurallyuncomlated patterns, as a function of the difference in the position of the learned stimuli in the fixed training sequence. From Miyashita (1988).
It then follows that attractors are correlated among themselves. Again, the attractors that are correlated are the neighbors in the sequence of the underlying patterns. These are just the type of correlations observed by Miyashita. In fact, the number of attractors that are found to be correlated significantly in the model is the same as in the experiment. The extended model is discussed in two different variants: one is the original formulation of k1 neurons, with the artificial symmetry between active and passive states of neurons; the second is a 0-1 formulation (Tsodyks and Feigel'man 1988; Buhmann et al. 19891, in which this symmetry is removed and that can naturally be interpreted in terms of high and low activity rates of neurons in attractors. The results differ in detail, but the main qualitative features, of converting sequential order among uncorrelated patterns to a set of correlated attractors, is present in both. Both models have symmetric synaptic matrices, which are unrealistic but convenient. The study of autoassociative ANNs, over the last several years, has made it clear that most of the attractor properties of these extensively connected networks are rather robust to the introduction of synaptic asymmetry (see, e.g., h i t 1989). We then proceed to interpret the proposed synaptic matrices in terms of learning dynamics. It is argued that rather plausible synaptic dynamics, accompanying the relaxation in the ANN, may produce a synaptic
M. Griniasty, M. V. Tsodyks, and D. J. Amit
4
matrix with correlated attractors for uncorrelated external stimuli. Within such learning scenarios, one is led to predict that the presentation of uncorrelated patterns in a random sequence would produce attractors that are uncorrelated, and are each close to the representation of the original patterns, as would be the case in the Hopfield model. Finally, we discuss the potential utility of such conversions of temporal correlations to spatial correlations in modeling several aspects of cognitive behavior. 2 The Model with fl Neurons
The original way of pursuing the Hopfield ANN program was to choose the variables describing the instantaneous states of each neuron S;(t)=fl, where i labels the neuron (i = I , . . . ,N). The patterns, to be stored in an N neuron network, are N-bit words of fls, the value of each bit chosen independently, with probability 0.5. Denoting the components of the activity of neuron number i in pattern number p, tr, the proposed synaptic matrix is written as:
where p is the total number of patterns stored in the connections. The patterns, p, are considered to form an ordered sequence, which corresponds to the order of presentation in the training phase. For simplicity, the sequence is taken to be cyclic. Each pattern in the construction of the matrix is connected to one preceding pattern. Note, in particular, that this extended matrix still preserves the symmetry of the Hopfield matrix, which implies that all attractors will be fixed points, and makes analysis so much simpler. How this relates to a learning scenario is discussed in Section 4. The matrix of equation 2.1, for a = 0, reduces to the original Hopfield matrix. This matrix is accompanied, as usual, by a schematic spike emission dynamics that, in the noiseless case, determines the new state, Si(t), of the neuron according to Si(t
+ 6t) = sign[hi(t + bt)]
(2.2)
where
h; mimics the value of the postsynaptic potential, relative to the threshold, on neuron i. The linear superposition of bilinear terms in the neural activities of the stored patterns is sometimes referred to as a Hebbian learning from a “tabula rasa.” We shall return to the question of learning later.
Conversion of Temporal to Spatial Correlations
5
The natural variables for the description of the nonergodic asymptotic behavior of the network are the “overlaps” m”(t) of the current state of the network, Si(t), with the stored pattern p. They are defined as
See, for example, Amit et al. (1985) and Amit (1989). The value of overlap
measures how close is the state of the network to the stored pattern mp = 1, the state is identical, as a binary word to the pattern p, that is, Si = #‘ for all i. If mp = 1 for the asymptotic state, the attractor, then the corresponding pattern is retrieved perfectly. With the matrix, equation 2.1, the “field hi can be expressed in terms of the overlaps (Amit et al. 19851, which implies that so can the dynamics of the network as well as its attractors. Namely, we can write equation 2.3 as mp
p. If
from which one derives the mean-field equations determining the attractors. For a symmetric matrix, those are simple fixed points. They read, in the limit of a network of a large number of neurons with a relatively low number of stored patterns:
The double angular brackets imply an averaging over the distribution of the bits in the patterns (see, e.g., Amit et al. 1985; Amit 1989). Autoassociation was the interpretation of the fact that, in the absence of noise, for low loading, the equations 2.6 had solutions with one single m p # 0, which attracted a wide set of initial states in the neighborhood of each pattern. Away from these large basins, ‘,spurious states” were found to exist (Amit et al. 1985). Moreover, the artificial symmetry of the +1 and -1 states produced attractors of the sign reversed states of each pattern. These, retrieval properties of the Hopfield ANN have been found very robust to extensive noise and synaptic disruption, including asymmetric disruption. If one tries a pure pattern solution for equations 2.6, with p = 2 for example, one has1:
m2 = ( ( t 2 w(m2[t2 + a(tl
+ t3)i)))
(2.7)
For a < 0.5, it is the first term in the square brackets that dominates the sign of the argument of the sign-function and m2 = 1 is a fixed point solution, as in the case a = 0. For a > 0.5, this is no longer the case. For ‘The 2s are superscripts not squares.
M. Griniasty, M. V. Tsodyks, and D. J. Amit
6
25% of the sites t3= t1= -t2 and the argument has the sign opposite to that of Starting from a state with m2 = 1, and all other overlaps 0,
c2,
one arrives, after one step to a state with m2 = m1 = m3 = 0.5. This is no fixed point either. The solutions of equations 2.6 have several overlaps different from zero. The previous discussion suggests a numerical procedure for arriving at the solution: start from a pure pattern state and iterate until convergence. This is what the network would do, if given one of the pure patterns it learned, Si = t’, as an initial state, until it relaxes to a fixed point. The symmetry of the dynamics under pattern permutations implies that this has to be done for one pattern only The equations were solved in this way. One finds that starting from a pure pattern, one arrives at a stable solution after several iterations. The solution reached is a state with nonzero overlaps with several stored patterns, symmetrically distributed around the pattern, which served as the stimulus. Only a small number, actually 5, of these overlaps are significantly large, provided u < 1.2 This distribution of overlaps in an attractor, corresponding to one of the underlying patterns, is shown in Figure 2. In this case p = 13 patterns are stored, and u = 0.7. It is remarkable that the structure of the attractor does not depend on the number of patterns p, nor on the value of a, in the entire range 0.5 < u < 1. For a > 1, the network develops attractors that have overlaps with all stored patterns. The values of the overlaps decrease as the number of patterns increases. This means that after learning sufficiently many patterns, the network loses its ability to associate attractors with the stimuli. One can read from Figure 2 that the retrieval attractor has substantial overlaps with several patterns, symmetrically disposed, before and after, in the sequence relative to the pattern corresponding to the stimulus. Clearly, if each attractor is correlated with several patterns, then the attractors corresponding to different patterns must themselves be correlated. These correlations would correspond to the correlations measured by Miyashita and Chang (1988) (Fig. 1). The correlation of activities in two attractors, u p and u’, is defined as i
N
C(ui”- a)(aiv- a ) 1‘1 j=l
C(p,v) = i
(2.8)
where 8 is the average activity in a given attractor and the normalization constant ICI is chosen so that C ( p ,p ) = 1. In the present case, T = 0, and (C( = N. Hence, the correlation of attractors p and v can be written as
1
= -
C sign(ha)sign(hr)= ((sign(hj’)sign(h”)))
N i
(2.9)
2Noteadded in proof L. Cugliandolo has recently proved that beyond 5 all overlaps are exactly zero.
Conversion of Temporal to Spatial Correlations
7
pattern
Figure 2 Overlaps between an attractor and stored patterns, as a function of the separation of the pattern in the sequence from the pattern underlying the attractor. where hy is the local field on neuron i when the network is in attractor number p. The last equality is an expression of self-averaging, giving an average over the distribution of patterns. Finally, substituting the fields from equation 2.5 in equation 2.9, we arrive at the correlation coefficient:
(2.10)
where m; is the overlap of the attractor corresponding to stimulus number p with pattern number p. These attractor overlaps are illustrated in Figure 3, where we plot the correlations among different pairs of attractors vs. the distance between their corresponding patterns. Figure 3 clearly demonstrates that while stored patterns are completely random, and hence uncorrelated, the states reached by the network on presentation of these same patterns, have a substantial degree of correlations, which decreases with the separation of the patterns in the sequence. Note that while an attractor "sees" two-three patterns on each side, it sees five attractors on each
8
M. Griniasty, M. V. Tsodyks, and D. J. Amit
Figure 3: Correlations between attractors as a function of the separation in the sequence of the patterns to which the attractors belong. side. The qualitative form of the correlations captures the experimental trend (Fig. 1). The absolute values of the correlations differ. This can be due to a different normalization used in Miyashita and Chang (1988), where the normalization is not given explicitly. It may also be that the absolute values will be different in more realistic networks. Finally, the analysis given above is based on the exact symmetry of the matrix 2.1, in which case the system has only fixed point attractors. Asymmetry can enter in two ways: either as a local, random disruption of the synaptic elements, or as a coherent asymmetry of the two transition terms in the symmetric matrix. The first type of asymmetry is in the realm of the robustness of the attractor dynamics of the network. Concerning the second type: The mean-field equations 2.6 hold even if the coefficients of the two transition terms in Jq are not equal. We have found that the behavior of the network is robust against some amount of asymmetry between the two coefficients. However, if the asymmetry becomes too large, equations 2.6 first acquire another solution, with the maximum overlap shifted to another pattern. At still higher asymmetry, the fixed point solution is lost, in favor of a time-dependent attractor. In this attractor the network moves from one pattern to another, in a direction determined by the major nondiagonal term in the matrix 2.1.
Conversion of Temporal to Spatial Correlations
9
3 ANN with Discrete 0-1 Neurons
The above description was extended to deal with a 0-1 representation of information, allowing for the removal of the symmetry between active and refractory states of the neurons (Tsodyks and Feigel’man 1988; Buhmann ef aI. 1989). This description has several further advantages: its terms are very close to a representation in terms of spike rates, which are positive analog variables; moreover it allows for very efficient storage of patterns as the coding rate, that is, the fraction of Is in the patterns, becomes very low-sparse coding. Since it is important to show that the correlation effects to be discussed take place also for this case, we shall recall that formulation as well. The dynamics is described in terms of instantaneous neural variables, V;(t), which take the values (0,U as V;(t
+ St) = 0
,]jjVj(t) [ j
I
-0
(3.1)
where Q ( x ) = 1 for x > 0, and 0 otherwise, and 0 is a neural threshold. These variables can be directly interpreted as high and low activity (spike rates) of each neuron. Such would be a description in terms of analog rates, in a high gain limit. In this case the patterns to be stored by a learning dynamics, q”, are chosen as N-bit words of independently chosen O,ls, that is, q;=o;1
p=ll...lp
(3.2)
where the probability for a 1,O-bit is f l (1- f ) , respectively. An extension of the symmetric synaptic matrix, appropriate for autoassociation (Tsodyks and Feigel’man 1988; Buhmann et al. 1989), to our requirements would be
(3.3)
and the corresponding overlaps are generalized to (3.4)
With the couplings 3.3, the dynamics can be expressed in terms of the above overlaps and so can the fixed points of the retrieval attractors. The latter have the form
M. Griniasty, M. V. Tsodyks, and D. J. Amit
10
When a = 0, the system of equations reduces to that of Tsodyks and Feigel'man (1988) and Buhmann et al. (1989). In this case, at low loading, the exact stored patterns are the retrieval attractors of the network, that is, the equations admit solutions with a single nonvanishing overlap, which in turn is equal to 1. These attractors persist until a reaches the critical value of
Above this value of a, the pure patterns are unstable and the network, having a symmetric synaptic matrix, finds new fixed points. The equations 3.5 have to be solved numerically for the values of the overlaps in the retrieval attractors. This we do, again following the network, as was explained in the previous section. A typical solution is shown in Figure 4, for parameter values: f = 0.01, 6 = 0.2, p = 11, and a = 0.25. Figure 4a represents the overlaps vs. the pattern number in the sequence, relative to the pattern of the stimulus. Figure 4b is the correlationbetween the attractors. Note that in distinction to the fl case, the significant overlaps here, of which there are five in total, are all equal. They are all unity, up to terms of OU).This implies that the attractor is approximately the union of the 1-bits in the five patterns centered around the stimulus. In particular, the mean spatial activity level in the attractors is higher than in the pure patterns. A fact that can be tested experimentally. The correlation Figure 4b may seem somewhat simple compared with the experimental one of Figure 1. Clearly, the experimental correlations are not a straight line going to zero at a separation of five patterns. We find the appearance of the correlations as well as their clear trend to decrease with the separation in the training sequence, down to very small values at a separation of five, very significant. All that was put in was the synaptic structure connecting successive patterns in the sequence. The remaining differences may be attributed to several factors, all of which are under study. These factors are a
The neurons in the experiment are analog neurons, represented by spike rates, and not discrete 0-1 neurons.
a
In the experiment the neurons operate in the presence of noise, while here for simplicity we dealt with a noiseless situation.
a
The matrix we chose is surely not the matrix in the monkey's cortex. One consequence is that all our attractors are identical.
a
In the experiment the sample groups of neurons are small and are chosen in special ways. This leads inter alia to large fluctuations. O u r correlations are ideal in that they take into account an infinite number of neurons.
Conversion of Temporal to Spatial Correlations
11
All these effects can be studied either by an extension of the above reasoning to analog neurons with noise, or by simulations. These studies are under way. The attractor to attractor correlations are computed according to equation 2.8. What remains is to determine r and ICI for this case. If the mean proportion of 1-bits in the attractors is g, then 8 = g and ICI = Ng(1 -g). It is rewarding to see that the correlations between neighboring attractors are monotonically decreasing with the separation in the sequence, and are disappearing after the fifth neighbor, as in the experimental data. In the present case, the number of condensed patterns, those having large values of the overlap with the attractor corresponding to a stimulus, depends on the value of u. This variation leaves finite intervals of u in which the attractors are invariant. Increasing u, we observe a sequence of bifurcations, where the number of condensed patterns increases by two. Correspondingly, the number of significantly correlated attractors increases by four on crossing a bifurcation value of a. Between any two bifurcation points the solution does not change, that is, the number of significantly correlated attractors as well as the magnitude of the correlations remain invariant.
4 Learning
In this section we will try to discuss possible learning scenarios, which could lead to a synaptic structure of the type considered in the previous sections. At the present time there is not enough information about the learning mechanism and memory preservation in the cortex, and our discussion can at best be tentative. We feel though that such a discussion may not be completely premature, just because of the level of specific detail provided by the experiments of Miyashita et al., and the ability of theory to approach a similar level of detail. Moreover, it is our feeling that a discussion of the implications to learning, of such findings, may lead to experiments which may shed additional light on constraints on learning through neurophysiological correlates of behavior. It is plausible to describe the synaptic dynamics as
(4.1)
where y is the rate of decay of the synaptic value and Ki,Lj are, respectively, the post- and pre-synaptic contributions to the synaptic efficacy. Both K and L depend on the activity of the corresponding neuron. A simple mechanism that would lead to the matrix 2.1 could be to apply a usual Hebbian modification rule, with both pre- and post-synaptic
M. Griniasty, M. V. Tsodyks, and D. J- Amit
12
a
0
2
4
6
pattern
8
10
12
b
Figure 4: Correlations between attractors (a) and overlaps of attractor with patterns (b) vs. the separation in the sequence of the patterns to which the attractors belong.
Conversion of Temporal to Spatial Correlations
13
terms as linear combinations of the current and preceding patterns, for example,
(4.2)
This form may result from two different scenarios. In both we assume that strong presentations of the individual, uncorrelated patterns create attractors for those patterns themselves. Then, during training, which consists of many repeated presentations, the network, which remains in an attractor between presentations, is made to move to the next attractor by a new presentation. It should be emphasized that in this description the role of the attractors is quite crucial. Before the patterns themselves are stored in the synapses as attractors, at the presentation of a consecutive pattern (to be learned) in the sequence during training, there is no memory of the previous pattern. This is especially true if the time between presentations of consecutive patterns is as long as in the experiments of Miyashita et al. The difference between the two scenarios is in the way we view the origin of the source terms, K and L, for the synaptic change. In the first we assume that the values of the neuronal spike frequencies represent, in an analog way, the transition between the two attractors. In this picture, K l ( t ) = L l ( t ) = 0, before the transition starts, which is about when the next pattern is presented. When the network is well established in the new attractor, K 2 ( t ) and L*(t) tend to zero. In the second scenario, one assumes that it is the local synaptic variable that remembers some short history of the pre- or postsynaptic activity of the corresponding neuron. For example, it may be the case that the synaptic mechanism modifies its effectiveness depending on the mean of the neuron's activity in some prior time window 7. The pre- or postsynaptic change may be enhanced, or suppressed, by a history of high average mean activity. Since the moving mean of the activity history is a linear combination of the activity in two consecutive patterns, while the network is moving from one attractor to the next, the end result is the same, provided, of course, that the averaging window is short in comparison with the time spent in each attractor. But this does not seem to be a strong requirement, given that the network stays in these attractors for many seconds. As a simple assumption we can take K = L, as functions of the p r e and post-synaptic activities, which implies KI = LI, K2 = L2. Relaxing this constraint would lead us to asymmetric synaptic transition terms, of the type discussed in Section 2. If the resulting asymmetry is not large we expect the performance of the network to be robust. In the symmetric case, the contribution to the synaptic dynamics is
14
M. Griniasty, M. V. Tsodyks, and D. J. Amit
As one pattern follows the other, these contributions sum up, when equation 4.1 is integrated. If we neglect the exponential decay, y, the summation is direct and after a long time, when all patterns have been presented many times in a fixed order, the resulting matrix would be proportional to
where the time integration is over an interval 7, in which synaptic modification is taking place. This matrix has the same form as the one we introduced in the previous sections. It corresponds though to a case in which u 5 0.5. This fact should not be considered too adverse. Synaptic decay, for example, is sufficient to raise u above 0.5. In the final analysis one should consider analog neurons, toward which the 0-1 neurons are an intermediate stage. Even for the discrete 0-1 neurons, the critical value of u is much lower than 0.5, while the heuristic learning mechanism can remain essentially the same. Finally, if the patterns are presented in a random order during training, one can expect every pattern to be followed by any other one, given that a large number of presentations is required for satisfactory learning. This implies that the transition terms in equation 4.4, containing any particular pattern, will be multiplied by a sum over all other patterns. That sum vanishes on the average and the transition terms become negligibly small. No correlations are then generated by the netwprk, from uncorrelated patterns. 5 Experimental Predictions and Some Speculations
Given that a synaptic matrix, which can be learned without supervision, is able to convert temporal correlations into spatial ones, one is tempted to make some preliminary speculations about the computational and behavioral utility of such synaptic development. One directly measurable application had been pointed out in Sakai and Miyashita (1991). In this experiment the monkeys are trained to recognize 24 visual patterns, organized in 12 fixed pairs. The pairs are presented in a random order. Correlations are generated among the two members of each pair only. Those correlations are then shown to be correlated with the ability of the monkey to retrieve the second member of a pair, after being presented with the first. The basic nature of this type of association for the construction of cognitive behavioral patterns is quite immediate. What is special about this particular experiment is that the associative retrieval of the paired member is directly connected to the presence of the correlations in the representation of the pairs of attractors in the part of cortex under electrophysiological observation.
Conversion of Temporal to Spatial Correlations
15
The interpretation of this experiment does not require speculation. To
go one step beyond, one can expect the generation of such correlations to underlie the effect of priming (Besner and Humphreys 1990). In other words, if the network is in one of its attractors, and a new stimulus is presented, the transition between two attractors that are highly correlated (i.e., have a particularly large number of active neurons common to their representations) is much faster than the transition between less correlated attractors. This effect was observed in a simulation with realistic neurons (Amit et al. 1991), when the pure patterns, involved in the construction of the synaptic matrix, included as explicit correlations (Rubin 1991). This effect can be directly measured in a Miyashita (1988) type experiment. One would expect that the transition time between different attractors would increase with the distance of the two patterns in the sequence of presentation. In cognitive psychology the effect is familiar in experiments in which the reaction time is measured for the recognition of distorted words or other images. This reaction time is significantly shortened if the pattern to be recognized is preceded by a cognitively correlated pattern (Besner and Humphreys 1990). In the language of the model we would say that the "priming" image leads the network into its corresponding attractor. That attractor is correlated with the attractor corresponding to the test stimulus. Hence, the transition between the two is faster than the transition from some other state in which the network may find itself in otherwise. Complementing this scenario with the suggestion that at least part of our basic cognitive correlations is related to temporal contiguity of afferent stimuli completes this speculation. This interpretation can be extended one small step further. As attractors get increasingly correlated, there is an increase in the probability that noise would cause transitions between them, transitions of the BuhmannSchulten type (Buhmann and Schulten 1987). This opens the way for the scenario in which such transitions can be provoked in a cortical network by random afferent activation of the module. The transitions will tend to take place between correlated attractors, which in the present model are related to temporal proximity during learning. Note that this process can also be observed in the experiments of the Miyashita type, though their cognitive content is more difficult to investigate. One could hope to be able to investigate the process of learning the matrix that generates the correlations. We have argued in Section 4 that the process will go through the intermediate stage of learning the pure pattern attractors first. This was based on the assumption that there is autonomous learning in the particular module under observation. This is not self-evident, and it may be that the pure patterns are quickly learned as attractors in a different area, as hippocampus, for example, and those attractors then assist in learning the correlated attractors. Since the question is open, one could attempt to clarify it by presenting different parts of the training sequence, in an experiment such as Miyashita (19881, with
16
M. Griniasty, M. V. Tsodyks, and D. J. Amit
different frequencies. Then, if learning actually first goes through the creation of individual attractors for the pure patterns, one should observe lower correlations in the parts shown less frequently, as well as lower coding rates. In other words, pure patterns are expected to use fewer neurons than the composite patterns correlated by the dynamics (see, e.g., Section 3). On the other hand, if the module learns the correlated attractors directly, no group of patterns should show the appearance of uncorrelated attractors. Acknowledgments
The authors acknowledge useful discussions with N. Rubin, S. Seung, and H. Sompolinsky DJA has benefited from many useful discussions of the learning mechanisms with Stefan0 Fusi. We are indebted to N. Rubin for information concerning priming effects. MVT's research is supported in part by the Ministry of Science and Technology. References Amit, D. J. 1989. Modeling Brain Function. Cambridge University Press, New York. Amit, D. J. 1992. In defence of single electrode recordings. NETWORK 4(4). Amit, D. J., Gutfreund, H., and Sompolinsky, H. 1985. Spin-glass models of neural networks. Phys. Rev. A32,1007. Amit, D. J.,Evans, M. R., and Abeles, M. 1991. Attrador neural networks with biological probe neurons. NETWORK 1,381. Amit, D. J., and Tsodyks, M. V. 1991. Quantitative study of attractor neural network retrieving at low spike rates I Substrate-spikes, rates and neuronal gain. NETWORK 2, 259; and Low-rate retrieval in symmetric networks. NETWORK 2,275. Besner, D., Humphreys, G. eds. 1990. Basic Processes in Reading: Visual word recognition. Hillside, NJ: Erlbaum; Tweney, R. D., Heiman, G. H., and Hoemann, H. W. 1977. Effects of visual disruption on sign intelligibility. J. Exp. Psycol. Gen.106,255. Buhmann, J., Divko, R., and Schulten, K. 1989. Associative memory with high information content. Phys. Rev. A39,2689. Buhmann, J.,and Schulten, K. 1987. Noise driven associations in neural networks. Europhys. Lett. 4, 1205. Hopfield, J. J. 1982. Neural networks and physical systems with emergent selective computational abilities. Proc. Natl. A d . Sci. USA. 79,2554. Miyashita, Y. 1988. Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature (London) 335,817. Miyashita, Y.,and Chang, H. S. 1988. Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature (London) 331,68. Rubin, N. 1991. Private communication.
Conversion of Temporal to Spatial Correlations
17
Sakai, K., and Miyashita, Y. 1991. Neural organization for the long-term memory of paired associates. Nature (London) 354, 152. Tsodyks, M. V., and Feigel'man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 46, 101.
Received 21 January 1992; accepted 5 May 1992.
This article has been cited by: 1. Valentina Daelli, Alessandro Treves. 2010. Neural attractor dynamics in object recognition. Experimental Brain Research 203:2, 241-248. [CrossRef] 2. Nicolas Brunel, Frédéric Lavigne. 2009. Semantic Priming in a Cortical Network ModelSemantic Priming in a Cortical Network Model. Journal of Cognitive Neuroscience 21:12, 2300-2319. [Abstract] [Full Text] [PDF] [PDF Plus] 3. F L Metz, W K Theumann. 2009. Symmetric sequence processing in a recurrent neural network model with a synchronous dynamics. Journal of Physics A: Mathematical and Theoretical 42:38, 385001. [CrossRef] 4. Tomoyuki Kimoto, Tatsuya Uezu, Masato Okada. 2008. Multiple Stability of a Sparsely Encoded Attractor Neural Network Model for the Inferior Temporal Cortex. Journal of the Physical Society of Japan 77:12, 124002. [CrossRef] 5. F. Metz, W. Theumann. 2007. Period-two cycles in a feedforward layered neural network model with symmetric sequence processing. Physical Review E 75:4. . [CrossRef] 6. Sawako Tanimoto, Masato Okada, Tomoyuki Kimoto, Tatsuya Uezu. 2006. Distinction of Coexistent Attractors in an Attractor Neural Network Model Using a Relaxation Process of Fluctuations in Firing Rates -- Analysis with Statistical Mechanics --. Journal of the Physics Society Japan 75:10, 104004. [CrossRef] 7. Masato Okada. 2006. Part 3: Brain science, information science and associative memory model. New Generation Computing 24:2, 185-201. [CrossRef] 8. F. Metz, W. Theumann. 2005. Pattern reconstruction and sequence processing in feed-forward layered neural networks near saturation. Physical Review E 72:2. . [CrossRef] 9. Tatsuya Uezu, Aya Hirano, Masato Okada. 2004. Retrieval Properties of Hopfield and Correlated Attractors in an Associative Memory Model. Journal of the Physics Society Japan 73:4, 867-874. [CrossRef] 10. Gianluigi Mongillo, Daniel J. Amit, Nicolas Brunel. 2003. Retrospective and prospective persistent activity induced by Hebbian learning in a recurrent cortical network. European Journal of Neuroscience 18:7, 2011-2024. [CrossRef] 11. Kazushi Mimura, Tomoyuki Kimoto, Masato Okada. 2003. Synapse efficiency diverges due to synaptic pruning following overgrowth. Physical Review E 68:3. . [CrossRef] 12. Masahiko Yoshioka, Masatoshi Shiino. 2000. Associative memory storing an extensive number of patterns based on a network of oscillators with distributed natural frequencies in the presence of external white noise. Physical Review E 61:5, 4732-4744. [CrossRef] 13. Kaname Toya, Kunihiko Fukushima, Yoshiyuki Kabashima, Masato Okada. 2000. Journal of Physics A: Mathematical and General 33:14, 2725-2737. [CrossRef]
14. Tomoki Fukai, Tomoyuki Kimoto, Makoto Doi, Masato Okada. 1999. Journal of Physics A: Mathematical and General 32:30, 5551-5562. [CrossRef] 15. Masahiko Yoshioka, Masatoshi Shiino. 1998. Associative memory based on synchronized firing of spiking neurons with time-delayed interactions. Physical Review E 58:3, 3628-3639. [CrossRef] 16. Néstor Parga , Edmund Rolls . 1998. Transform-Invariant Recognition by Association in a Recurrent NetworkTransform-Invariant Recognition by Association in a Recurrent Network. Neural Computation 10:6, 1507-1525. [Abstract] [PDF] [PDF Plus] 17. Asohan Amarasingham , William B. Levy . 1998. Predicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction ModelPredicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction Model. Neural Computation 10:1, 25-57. [Abstract] [PDF] [PDF Plus] 18. Daniel J. Amit, Stefano Fusi, Volodya Yakovlev. 1997. Paradigmatic Working Memory (Attractor) Cell in IT CortexParadigmatic Working Memory (Attractor) Cell in IT Cortex. Neural Computation 9:5, 1071-1092. [Abstract] [PDF] [PDF Plus] 19. Nicolas Brunel. 1996. Hebbian Learning of Context in Recurrent Neural NetworksHebbian Learning of Context in Recurrent Neural Networks. Neural Computation 8:8, 1677-1710. [Abstract] [PDF] [PDF Plus] 20. Hans Liljenström. 1996. Neuromodulation can significantly change the dynamical state of cortical networks. Behavioral and Brain Sciences 19:02, 303. [CrossRef] 21. Paul L. Nunez. 1996. Multiscale neocortical dynamics, experimental EEG measures, and global facilitation of local cell assemblies. Behavioral and Brain Sciences 19:02, 305. [CrossRef] 22. J. J. Wright, D. T. J. Liley. 1996. Dynamics of the brain at global and microscopic scales: Neural networks and the EEG. Behavioral and Brain Sciences 19:02, 285. [CrossRef] 23. Ichiro Tsuda. 1996. The form of chaos in the noisy brain can manifest function. Behavioral and Brain Sciences 19:02, 309. [CrossRef] 24. Valerie Gray Hardcastle. 1996. Modeling for modeling's sake?. Behavioral and Brain Sciences 19:02, 299. [CrossRef] 25. Andrew Oliver. 1996. Dynamics of the brain — from the statistical properties of neural signals to the development of representations. Behavioral and Brain Sciences 19:02, 306. [CrossRef] 26. Hubert Preissl, Werner Lutzenberger, Friedemann Pulvermüller. 1996. Is there chaos in the brain?. Behavioral and Brain Sciences 19:02, 307. [CrossRef] 27. Márk Molnár. 1996. Chaos in induced rhythms of the brain – the value of ERP studies. Behavioral and Brain Sciences 19:02, 305. [CrossRef]
28. Harry R. Erwin. 1996. Multiscale modeling of the brain should be validated in more detail against the biological data. Behavioral and Brain Sciences 19:02, 297. [CrossRef] 29. M. N. Zhadin. 1996. Rhythmicity in the EEG and global stabilization of the average level of excitation in the cerebral cortex. Behavioral and Brain Sciences 19:02, 309. [CrossRef] 30. Zbigniew J. Kowalik, Andrzej Wrobel, Andrzej Rydz. 1996. Why does the human brain need to be a nonlinear system?. Behavioral and Brain Sciences 19:02, 302. [CrossRef] 31. Robert Miller. 1996. Empirical data base for simulation: Firing rates and axonal conduction velocity for cortical neurones. Behavioral and Brain Sciences 19:02, 304. [CrossRef] 32. J. J. Wright, D. T. J. Liley. 1996. Multiscale modeling of brain dynamics depends upon approximations at each scale. Behavioral and Brain Sciences 19:02, 310. [CrossRef] 33. Walter S. Pritchard. 1996. The EEG data indicate stochastic nonlinearity. Behavioral and Brain Sciences 19:02, 308. [CrossRef] 34. Edgar Koerner. 1996. Comparative reduction of theories — or over-simplification?. Behavioral and Brain Sciences 19:02, 301. [CrossRef] 35. Lester Ingber. 1996. Nonlinear nonequilibrium nonquantum nonchaotic statistical mechanics of neocortical interactions. Behavioral and Brain Sciences 19:02, 300. [CrossRef] 36. Theodore H. Bullock. 1996. Is the distribution of coherence a test of the model?. Behavioral and Brain Sciences 19:02, 296. [CrossRef] 37. Péter Érdi. 1996. Levels, models, and brain activities: Neurodynamics is pluralistic. Behavioral and Brain Sciences 19:02, 296. [CrossRef] 38. Daniel J. Amit. 1996. Is the time ripe for integration of scales?. Behavioral and Brain Sciences 19:02, 295. [CrossRef] 39. Walter J. Freeman. 1996. Neural system stability. Behavioral and Brain Sciences 19:02, 298. [CrossRef] 40. Xiangbao Wu, Robert A. Baxter, William B. Levy. 1996. Context codes and the effect of noisy learning on a simplified hippocampal CA3 model. Biological Cybernetics 74:2, 159-165. [CrossRef] 41. Ehud Ahissar. 1995. Are single-cell data sufficient for testing neural network models?. Behavioral and Brain Sciences 18:04, 626. [CrossRef] 42. Masahiko Morita. 1995. Another ANN model for the Miyashita experiments. Behavioral and Brain Sciences 18:04, 639. [CrossRef] 43. Ralph E. Hoffman. 1995. Additional tests of Amit's attractor neural networks. Behavioral and Brain Sciences 18:04, 634. [CrossRef]
44. Shimon Edelman. 1995. How representation works is more important than what representations are. Behavioral and Brain Sciences 18:04, 630. [CrossRef] 45. Maartje E. J. Raijmakers, Peter C. M. Molenaar. 1995. How to decide whether a neural representation is a cognitive concept?. Behavioral and Brain Sciences 18:04, 641. [CrossRef] 46. Daniel J. Amit. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences 18:04, 617. [CrossRef] 47. Walter J. Freeman. 1995. The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioral and Brain Sciences 18:04, 631. [CrossRef] 48. David C. Krakauer, Alasdair I. Houston. 1995. An evolutionary perspective on Hebb's reverberatory representations. Behavioral and Brain Sciences 18:04, 636. [CrossRef] 49. Joaquin M. Fuster. 1995. Not the module does memory make – but the network. Behavioral and Brain Sciences 18:04, 631. [CrossRef] 50. Peter M. Milner. 1995. Attractors – don't get sucked in. Behavioral and Brain Sciences 18:04, 638. [CrossRef] 51. Friedemann Pulvermüller, Hubert Preissl. 1995. Local or transcortical assemblies? Some evidence from cognitive neuroscience. Behavioral and Brain Sciences 18:04, 640. [CrossRef] 52. J. J. Wright. 1995. How do local reverberations achieve global integration?. Behavioral and Brain Sciences 18:04, 644. [CrossRef] 53. Elie Bienenstock, Stuart Geman. 1995. Where the adventure is. Behavioral and Brain Sciences 18:04, 627. [CrossRef] 54. Frank der van Velde. 1995. Association and computation with cell assemblies. Behavioral and Brain Sciences 18:04, 643. [CrossRef] 55. Wolfgang Klimesch. 1995. The functional meaning of reverberations for sensoric and contextual encoding. Behavioral and Brain Sciences 18:04, 636. [CrossRef] 56. Eric Chown. 1995. Reverberation reconsidered: On the path to cognitive theory. Behavioral and Brain Sciences 18:04, 628. [CrossRef] 57. Jean Petitot. 1995. The problems of cognitive dynamical models. Behavioral and Brain Sciences 18:04, 640. [CrossRef] 58. G. J. Dalenoort, P. H. de Vries. 1995. What's in a cell assembly?. Behavioral and Brain Sciences 18:04, 629. [CrossRef] 59. Josef P. Rauschecker. 1995. Reverberations of Hebbian thinking. Behavioral and Brain Sciences 18:04, 642. [CrossRef] 60. Michael Hucka, Mark Weaver, Stephen Kaplan. 1995. Hebb's accomplishments misunderstood. Behavioral and Brain Sciences 18:04, 635. [CrossRef] 61. Anders Lansner, Erik Fransén. 1995. Distributed cell assemblies and detailed cell models. Behavioral and Brain Sciences 18:04, 637. [CrossRef]
62. Morris W. Hirsch. 1995. Mathematics of Hebbian attractors. Behavioral and Brain Sciences 18:04, 633. [CrossRef] 63. Daniel J. Amit. 1995. Empirical and theoretical active memory: The proper context. Behavioral and Brain Sciences 18:04, 645. [CrossRef] 64. W Whyte, D Sherrington, A C C Coolen. 1995. Journal of Physics A: Mathematical and General 28:12, 3421-3437. [CrossRef] 65. Michel Kerszberg, Claudine Masson. 1995. Signal-induced selection among spontaneous oscillatory patterns in a model of honeybee olfactory glomeruli. Biological Cybernetics 72:6, 487-495. [CrossRef] 66. Tomoki Fukai. 1995. A model cortical circuit for the storage of temporal sequences. Biological Cybernetics 72:4, 321-328. [CrossRef] 67. Iris Ginzburg, Haim Sompolinsky. 1994. Theory of correlations in stochastic neural networks. Physical Review E 50:4, 3171-3191. [CrossRef] 68. L. F. Cugliandolo . 1994. Correlated Attractors from Uncorrelated StimuliCorrelated Attractors from Uncorrelated Stimuli. Neural Computation 6:2, 220-224. [Abstract] [PDF] [PDF Plus] 69. L F Cugliandolo, M V Tsodyks. 1994. Journal of Physics A: Mathematical and General 27:3, 741-756. [CrossRef]
NOTE
Communicated by Hal White
On the Realization of a Kolmogorov Network Ji-Nan Lin Rolf Unbehauen LehrstuhlfUr Allgemeine und TheoretischeElektrotechnik, Universitdt Erlangen-Niirnberg,Cauerstrasse 7, D-8520 Erlangen, Germany
It has been suggested that the theorem by Kolmogorov (1957) about the multivariate function representation in the form (0.1)
with Q 2 2N + 1 provides theoretical support for neural networks that implement multivariate mappings (Hecht-Nielson 1987; Lippmann 1987). Girosi and Poggio (1989) criticized Kolmogorov's theorem as irrelevant. They based their criticism mainly on the fact that the inner functions p,,. are highly nonsmooth and the output functions gq are not in a parameterized form. However, this criticism was not convincing: Kurkova (1991) argued that highly nonsmooth functions can be regarded as limits or sums of infinite series of smooth functions, and the problems in realizing a Kolmogorov network can be eliminated through approximately implementing (~4. and g, with known networks. In this note we present our view on the discussion from a more essential point of view. Since P(., in equation 0.1 should be universal, Kolmogorov's theorem can be regarded as a proof of a transformation of representation of multivariate functions in terms of the Q univariate output functions g,. [In some improved versions of Kolmogorov's theorem it is proved that only one g in equation 0.1 is necessary (Lorentz 19661.1 Such a strategy is embedded in the network structure as shown in Figure 1. (Note that the block T is independent off.) If Figure 1 is thought of as a general network structure for approximation of multivariate functions, a question is whether an arbitrarily given multivariate function f can be (approximately) implemented through an (approximate) implementation of the corresponding Q univariate functions g,. To this question we have an answer as stated below: Proposition. In Figure 1, an approximate implementation of gq does not in general deliver an approximate implementation of the original function f, unless g, can be exactly implemented. Neural Computation 5,18-20 (1993)
@ 1993 Massachusetts Institute of Technology
Realization of a Kolmogorov Network
19
: IT1 Figure 1: The basic strategy in Kolmogorov’s theorem is embedded in a network structure where a universal transformation T maps the multidimensional Euclidean space (the domain of multivariate functions) into one or several unidimensional ones. Here we mean by function approximation a mechanism that provides an estimation of the corresponding output (function value) at an arbitrary point in the input space (domain), which is meaningful in some sense (e.g., an interpolation of some given samples). Such a procedure is closely related to the “spatial relation’’ (e.g., the Euclidean distance) defined in the input space. For instance, from the viewpoint of interpolation, an estimation of the function value at a point depends on its position relative to that of the sample points in the domain, as well as on the sample values. As we know, there does not exist a homeomorphism between a multidimensional Euclidean space and an interval in the real line, i.e., there does not exist a way to map points from the former into the latter while preserving the spatial relations between them. (It is due to this nonhomeomorphism that the inner functions vqnmust be highly nonsmooth;) That means, approximating gq in the real line may not have an equivalent meaning to that of approximating the original function f in a multidimensional space. For instance, a ”reasonable” interpolation in an interval in the real line between two samples of a gq may lose its meaning in the corresponding regions in the multidimensional domain off. In fact, there exist various methods of one-to-one mapping from a multidimensional space to a unidimensional one. (The construction of the inner functions vqnin Kolmogorov‘s theorem is only one of them.) Therefore, it is not difficult to understand that the structure in Figure 1
20
Ji-Nan Lin and Rolf Unbehauen
is, theoretically, for an exact representation of multivariate functions. If the uni-input subnetwork in Figure 1 are assumed to be adjustable for each input point, that is, they are able to provide for each input point an arbitrarily desired output independent of the other points, then g, and thus f can be exactly implemented by the network. However, such an assumption has little significance in practice. On the other hand, from the viewpoint of information theory, the implementation of the univariate functions g, does not mean a simplification of that of the original multivariate one f , since the universality of the inner structure T implies that all the information describing f should be carried by 8,. Based on the above discussion, we believe that Kolmogorov’s theorem is irrelevant to neural networks for mapping approximation. The discussion is illustrated by a network model with an example of function approximation (Lin and Unbehauen 1992). The consequence of our discussion is not encouraging for efforts toward constructing mapping approximation networks along the lines of Kolmogorov’s theorem and its proofs. However, if we take equation 0.1 as a general approximation representation of multivariate functions in terms of summation and superposition of univariate functions, it is closely relevant to mapping networks. Some useful neural networks (e.g., the perceptron type network) can be represented by equation 0.1.
References Girosi, F., and Poggio, T. 1989. Representation properties of networks: Kolmogorov’s theorem is irrelevant. Neural Comp. 1,465469. Hecht-Nielson,R. 1987. Kolmogorov‘s mapping neural network existence theorem. In Proceedings of the International Conference on Neural Networks, pp. 111, 11-14. IEEE. Kolmogorov, A. N. 1957. On the representation of continuous functions of several variables in the form of a superposition of continuous functions of one variable and addition. Dokl. Akad. Nauk USSR 114(5),953-956. Kurkova, V. 1991. Kolmogorov’s theorem is relevant. Neural Comp. 3, 617-622. Lin, J.-N., and Unbehauen, R. 1992. A simplified model for understanding the Kolmogorov network. In Proceedings URSl Int. Symp. on Signals, Systems and Electronics, ISSSE 92, Paris, 11-14. Lippmann, R. P. 1987. An introduction to computing with neural nets. I E E E ASSP Mag. 4-22. Lorentz, G. G. 1966. Approximation offunctions, Holt, Rinehart & Winston, New York. Received 6 February 1992; accepted 5 June 1992.
This article has been cited by: 1. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef] 2. Yu G. Smetanin. 1998. Neural networks as systems for recognizing patterns. Journal of Mathematical Sciences 89:4, 1406-1457. [CrossRef] 3. Mohammad Bahrami. 1995. Issues on representational capabilities of artificial neural networks and their implementation. International Journal of Intelligent Systems 10:6, 571-579. [CrossRef]
Communicated by Lawrence Abbott
Statistical Mechanics for a Network of Spiking Neurons Leonid Kruglyak' William Bialek Department of Physics, and Department of Molecular and Cell Biology, University of California at Berkeley, Berkeley, CA 94720 USA and NEC Research Institute, 4 Independence Way, Princeton, N]08540 USA
We show that a simple statistical mechanics model can capture the collective behavior of large networks of spiking neurons. Qualitative arguments suggest that regularly firing neurons should be described by a planar "spin" of unit length. We extract these spins from spike trains and then measure the interaction Hamiltonian using simulations of small clusters of cells. Correlations among spike trains obtained from simulations of large arrays of cells are in quantitative agreement with the predictions from these Hamiltonians. We comment on the novel computational abilities of these "XY networks."
1 Introduction Understanding the computations performed by biological nervous systems requires methods for describing the collective behavior of large networks of neurons. For a physicist, the natural tool is statistical mechanics. The development of the Hopfield model (Hopfield 1982) and its relation to king spin glasses have resulted in the many recent efforts along these lines (Amit 1989). While these neural network studies have produced a number of interesting results, they have been based on extremely simplified models of neurons-often just two-state devices. The hope has been that most microscopic biological details are not important for the collective computational behavior of the system as a whole. Renormalization group ideas have taught us that microscopic details are often irrelevant, but in the neural case it is unclear exactly where to draw the line between important features and incidental details. Most real neurons produce trains of identical action potentials, or spikes, and it is the timing of these spikes that carries information; in many cases significant computations are carried out on time scales comparable to the interspike intervals (de Ruyter van Steveninck and Bialek 1988; Bialek et al. 1991). 'Present address: Theoretical Physics, Oxford University, 1 Keble Rd., Oxford OX1 3NP.
Neural Computation 5,21-31 (1993)
@ 1993 Massachusetts Institute of Technology
22
Leonid Kruglyak and William Bialek
How do we relate the observable sequences of spikes to the local spins or fields in a statistical mechanics model? In this paper we begin with a semirealistic model for spiking neurons in a network and systematically construct a corresponding statistical mechanics model for interacting spins. This model is successful in that it accurately describes the correlations among spike trains observed in simulations of large arrays of interconnected neurons. The construction is largely numerical, and while we can offer analytical explanations only for some of its successes, we believe that the numerical results themselves are of considerable interest. We emphasize that the goal of this work is not to model a particular part of the nervous system but rather to show that an explicit reduction of a network of spiking cells to a statistical mechanics spin system is possible.
2 A Model for Spiking Cells
We use the Fitzhugh-Nagumo (FN) model (Fitzhugh 1961; Nagumo et al. 1962) to describe the electrical dynamics of an individual neuron. This model demonstrates a threshold for firing action potentials, a refractory period, and single-shot as well as repetitive firing-in short, all the qualitative properties of neural firing. It is also known to provide a reasonable quantitative description of several cell types (Rinzel and Ermentrout 1989; Fitzhugh 1969). To be realistic it is essential to inject into each cell a noise current bin(t ) , which we take to be gaussian, spectrally white, and independent in each cell n. We model a synapse between two neurons by exponentiating the voltage from one and injecting it as current into the other. Our choice is motivated by the fact that the number of transmitter vesicles released at a synapse is exponential in the presynaptic voltage (Aidley 1971); other synaptic transfer characteristics, including small delays, give results qualitatively similar to those described here. We emphasize that all of our methods for analyzing the FN model with exponential synapses have been applied to other models of spike generation with essentially identical results (Kruglyak 1990). In particular, both a simpler "integrate and fire" model (Gerstner 1991) and a more complicated channel model (Rinzel and Ermentrout 1989; Morris and Lecar 1981) in which regular bursts of spikes play the same role as the spikes do in the FN model were examined in detail. The equations of motion for a network of interacting noisy FN neurons are
(2.1)
Network of Spiking Neurons
23
where V , is the transmembrane voltage in cell n, l o is the dc bias current, and the W,,are auxiliary variables; VOsets the scale of voltage sensitivity in the synapse. Voltages and currents are dimensionless, and the parameters of the system are expressed in terms of the time constants T~ and T~ and a dimensionless ratio a. The FN model with noise has the standard sigmoidal input/output relation if one plots the firing rate vs. dc injected current in a single cell.' Most neural network models take this simple relation to be the central feature of neural firing. However, the sigmoidal i/o relation hides the experimentally salient distinction between regular and irregular patterns of firing. Regular firing is characterized by a tightly bunched interspike interval distribution; in the irregular case the distribution is approximately Poisson. These different regimes correspond to different parameter values in the model, and experiments on sensory neurons suggest that many cells are confined to one regime or the other under natural conditions (Teich and Khanna 1985; Goldberg and Fernandez 1971). Networks of neurons in the two regimes should exhibit very different collective properties. We study in detail the case of regular firing. If the intervals between spikes cluster tightly about a mean value, the firing is nearly periodic, and we expect that it should be describable by an underlying oscillation. Note that we are not talking about a perturbed oscillation of the noiseless FN equations; in fact, the simulations described below are carried out in a parameter region where those equations do not oscillate-the oscillations we see are noise-driven. Also, when cells are coupled together the interactions cannot be treated as a small effect since a single spike from one cell can trigger a spike in another cell. Hence the standard analytic methods for reducing complex nonlinear oscillations to phase equations (Kuramoto 1984) do not apply, and we have to follow our intuition and attempt to extract the underlying oscillation numerically.
3 From Spike Trains to Spins
We assume that spikes occur at times ti when the total phase of an oscillation crosses zero (mod27r). Then this oscillation has a mean frequency wo that is simply related to the mean interspike interval and a slowly varying phase 4(t) that describes the deviations from perfect periodicity. If this description is correct, the spike train s ( t ) = xi 6 ( t - t i ) should have a power spectrum with well resolved peaks at kwo, f2w0,.. .. This is what we observe in simulations of the FN model for one isolated cell. 'The noiseless behavior is quite different-there is a threshold current below which the rate is zem and above which it jumps to a finite value. There is also a second threshold current above which the cell ceases to fire.
Leonid Kruglyak and William Bialek
24
We then low-pass filter s ( t ) to keep only the f w o peaks, obtaining a phase and amplitude modulated cosine,
[Fs](t)x A ( t )C O S [ U ~+~ 4(t)]
(3.1)
where [Fs](t)denotes the filtered spike train, and the amplitude A ( t ) and the phase 4(t) vary slowly with time. Plotting the filtered spike train against its time derivative results in a phase portrait characteristic of a noisy sinusoidal oscillation, confirming equation 3.1. As indicated earlier, very similar phase portraits are obtained for the Gerstner and bursting Morris-Lccar models (Kruglyak 1990). Hence the spike train can be described by a two-vector in the phase plane. We do not expect the magnitude of this vector to matter since A(t) is related to the unimportant details of the filtering process and to the biologically irrelevant differences among spikes. The orientation of the vector, now assumed to be of unit length, gives us the phase. Using the phase portrait it is thus possible to process the spike train from a neuron and recover a time-dependent, planar unit spin S ( t ) . We now want to see how these spins interact when we connect two cells via synapses. 4 From Synapses to Spin-Spin Interactions
We characterize the two-neuron interaction by accumulating a histogram of the phase differences between two neurons connected via a model synapse? A variety of synaptic interactions have been examined; the results below though not the exact form of the interaction hold in every case. The probability distribution of the phase difference defines an effective Hamiltonian, P(&,&) o( exp[-H($l - 42)]. Note that this Hamiltonian is simply another way to characterize the equilibrium distributionit is not meant to describe the time evolution of the system. Hence it remains a useful concept even when the standard notion of an energy function breaks down. This is in contrast with the usual statistical mechanics approach to neural networks, which assumes a Liapunov dynamics for the noiseless case and then treats all noise by promoting the Liapunov function to the role of a true Hamiltonian and placing it at finite temperature. The assumption of Liapunov dynamics cannot be justified for biological networks, and the noise can be more complicated (Crair and Bialek 1990). Figure 1 shows the effective Hamiltonian for a pair of symmetrically connected cells. We see that with excitatory synapses (1 > 0) the interaction is ferromagnetic, as expected. Once again, the Hamiltonians for the Gerstner and bursting Morris-Lecar models show only minor variations from the FN Hamiltonian (Kruglyak 1990). *An interaction that depends only on the phase difference is only the simplest case; there could in principle be a dependence on the absolute phase of one of the cells as well. As mentioned below and described in detail elsewhere, no such dependence is seen when we look at probability distributions of phases in small clusters of cells.
Network of Spiking Neurons
25
Effective Hamiltontans 6.0
4.0
I
*r^2, $0
8.0
'
V
-0.0
0.0
P
2.0
4.0
h o(llmne.
Figure 1: Effective Hamiltonians for three values of the coupling strength I. From simulations of equation 3.1 for two cells with 71 = 0.1, Q = 10,Io = -0.25, Vo = 0.3,a = 1.1, and 61, of spectral density S, = 1.25 x With arrays of more than two neurons it is possible that the effective Hamiltonian includes more than just a simple nearest-neighbor interaction. We have searched for these effects in simulations on small clusters of cells, and always find that the observed phase histograms can be accounted for by appropriate convolutions of the histograms found in the two-neuron simulations. This leads us to predict that the statistical mechanics of an entire network will be described by the effective Hamiltonian H = CijHij(4i - 4j), where Hq(+i - 4j) is the effective Hamiltonian measured for a pair of connected cells i , j as in Figure 1. 5 Correlation Functions
One crucial consequence of equation 3.1 is that correlations of the filtered spike trains are exactly proportional to the spin-spin correlations that are the natural objects in statistical mechanics. Specifically, if we have two cells n and rn,
(Sn . S m ) = (cos(4n - 4m)) = ~,-~([FsnI(t)[FsmI(t))
(5.1)
This relation shows us how the statistical description of the network can be tested in experiments that monitor actual neural spike trains.
Lmnid Kruglyak and William Bialek
26
5.1 One Dimension. It is well known that when planar spins are connected in a one-dimensional chain with nearest-neighbor interactions, correlations between spins drop off exponentially with distance. The mapping from spike trains to spins predicts that this exponential decay will also be observed in the spike train correlation functions. To test this prediction we have run simulations on chains of 32 Fitzhugh-Nagumo neurons connected to their nearest neighbors. Correlations computed directly from the filtered spike trains as indicated above indeed decay exponentially, as seen in Figure 2. More complicated correlation functions are predicted and observed with, for example, delayed or asymmetric synaptic interactions (Kruglyak 1990). In Figure 3 we compare the correlation lengths predicted from the effective Hamiltonians determined as described above with the correlation lengths extracted from the realistic simulations; the agreement is excellent. We emphasize that while the “theoretical” correlation lengths are based on the statistical mechanics of a simple spin model, the “experimental” correlation lengths are based on measured correlations among spike trains generated by realistic neurons. There are no free parameters.
Correlation us. Distance i n 1 Damsns-ion
0.0
10.0
6.0 M - e
(n &tUw
16.0
zo.0
Spdnoa
Figure 2: Correlation function in 1 dimension for three values of the coupling strength (lowest curve, J = 0.0005, middle curve, J = 0.001, highest curve, J = 0.0015). The points are obtained from simulations of chains of neurons; the lines are best-fit exponentials.
Network of Spiking Neurons
27
6.0
4.0
!
j
3.0
E
i
2.0 1.0
0.0
0.0
1.0
2.0
3.0
4.0
6.0
Ihdlcted C m l o l l a Length
Figure 3: Correlation length obtained from fits to the simulation data vs. correlation length predicted from the Hamiltonians. 5.2 ’Itvo Dimensions. In the twodimensional case we connect each neuron to its four nearest neighbors on a square lattice. The corresponding spin model is essentially the XY model, but from Figure 1 we see that the interaction potential between neighbor spins has a somewhat different form, with the energy rapidly reaching a plateau as the spins tilt apart. The XY model itself is of course well understood (Kosterlitz and Thouless 1973; Nelson 1983), and models with potentials as in Figure 1 should be in the XY universality class. To get a feeling for the relation between the Hamiltonian in Figure 1 and the XY model we follow Jose et al. (1977) and carry out a MigdalKadanoff renormalization. We recall that for the XY model itself this approximation leads to an “almost” fixed line at low temperatures, while above the transition the flow is to infinite temperature. We find the same results in our model; the “almost” fixed line is the same and flow to infinite temperature is asymptotically along the same curve as in the XY case. The similarity of flows in the XY model and in our model is so great that we feel confident in predicting the conditions under which the network should exhibit algebraic or exponential decay of correlations. We then check these predictions directly by simulating two-dimensional arrays of neurons with toroidal boundary conditions. Figure 4 shows
Leonid Kruglyak and William Bialek
28
a
0.0
-2.0
I
-4.0
0.0
6.0
10.0
16.0
DLttonce tn LcJt(ca Spoclns.
b
..
-0.4
I
I 0.
_."
"tttt
-n
0.0
1.0
2.0
8.0
4.0
Figure 4: Conelation function in 2 dimensions. (a) Above the transition, loglinear plot; linear behavior indicates exponential fall-off. (b) Below the transition, log-log plot; linear behavior indicates algebraic fall-off. the spike train correlations as a function of distance for two coupling strengths, one below and one above the transition point suggested by the renonnalization calculation. The low coupling (high T ) data are taken on a 32 x 32 lattice and are well described by exponential fall-off, at least until the correlations are so small as to be lost in the statistical noise. In the strong coupling (low T ) case we study a 128 x 128 lattice and find that
Network of Spiking Neurons
29
the correlations first decay algebraically and then plateau and become irregular due to finite size effects. In the algebraic phase the long-distance behavior of the system should be describable in terms of spin waves (Kosterlitz and Thouless 1973). We can fit the data of Figure 4b, except at the shortest distances, with an exact lattice spin wave calculation in which the critical exponent r] x 0.2. This is reasonable since we are just below the apparent phase transition, 0.25. We have also looked at the effects of several kinds of where r] disorder (deleting a fraction of connections, randomly choosing coupling strengths, randomly assigning the inherent firing frequencies of the cells) and find at most a shift in the transition coupling with no apparent qualitative changes when the disorder is small (Kruglyak 1990; Kruglyak and Bialek 1991b). In particular, the phase with algebraic decay of correlations is preserved, albeit at stronger couplings. This result also holds for the other neural models described in Section 2. This robustness to changes in connectivity, connection strengths, and internal dynamics of neurons gives us confidence that the model discussed in this paper is applicable to real biological systems. 6 Discussion
To summarize, we have found a systematic procedure for extracting spin variables from spike trains for a particular class of realistic model neurons. We then measure an effective interaction Hamiltonian by simulating small clusters of cells. This allows us to formulate a statistical mechanics model for a network of spiking cells by directly referring to a more microscopic model rather than by simply postulating a coarsegrained description. We use correlations between spike trains to test the collective properties of networks predicted from the model and find that the predictions are both qualitatively and quantitatively correct. The fact that our particular network is described by an XY-like model is especially interesting. There are many computational problems, especially in vision, where one would like to make comparisons among signals carried by neurons separated by large physical distances in a given layer of the brain. It has been traditionally assumed that the range of possible comparisons is limited by the range of physical interconnections, which is typically quite short. Physically we know that this is wrong, since we can have long-range correlations near a critical point even when the microscopic interactions are short ranged. The difficulty is that the system must be carefully poised at the critical point, but this is not a problem in XY systems, which have a critical line and hence an entire regime of long-range correlations. Reasonably regular two-dimensional architectures are common in regions of the nervous system devoted to sensory information processing. In many of these systems one can observe neural responses to stim-
30
Leonid Kruglyak and William Bialek
uli that provide direct input only to very distant neurons (Allman et al. 1985);such responses are described as coming from ”beyond the classical receptive field.” The power-law decay of correlations in XY-like models may provide an apt description of the gradual decrease in responsiveness to more distant stimuli found in these experiments. If the XY scenario is applicable, we expect that these neurons should be regularly firing, and there is recent evidence for such oscillatory behavior in cortical cells that exhibit long-range responses (Gray and Singer 1989; Gray et al. 1989; Eckhorn et al. 1988). Simulations show that local temporal correlations in such networks can indeed carry information about large-scale spatial properties of the stimulus (Kruglyak 1990; Kruglyak and Bialek 1991a).
Acknowledgments We thank 0. Alvarez, D. Arovas, A. B. Bonds, K. Brueckner, M. Crair, E. Knobloch, and H. Lecar for helpful discussions. The work at Berkeley was supported in part by the National Science Foundation through a Presidential Young Investigator Award (to W. B.), supplemented by funds from Cray Research, Sun Microsystems, and the NEC Research Institute, by the Fannie and John Hertz Foundation through a Graduate Fellowship (to L. K.), and by the USPHS through a Biomedical Research Support Grant.
References Aidley, D. J. 1971. The Physiology of Excitable Cells. Cambridge University Press, Cambridge. Amit, D. J. 1989. Modeling Brain Function. Cambridge Universisty Press, Cambridge. Allman, J., Meizin, F., and McGuiness, E. 19851 Stimulus specific responses from beyond the classical receptive field: Neurophysiological mechanisms for local-global comparisons in visual neurons. Annu. Rev.Neurosci. 8,407. Bialek, W., Rieke, F., de Ruyter van Steveninck, R. R., and Warland, D. 1991. Reading a neural code. Science 252,1854. Crair, M. C., and Bialek, W. 1990. Non-Boltzmann dynamics in networks of spiking neurons. In Advances in Neural lnformation Processing Systems, 2, D. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. de Ruyter van Steveninck, R. R., and Bialek, W. 1988. Real-time performance of a movement-sensitive neuron in the blowfly visual system: Coding and information transfer in short spike sequences. Proc. R. SOC.Lond. B 234,379. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60,121. Fitzhugh, R. 1961. Impulses and physiological states in theoretical models of nerve membrane. Biophys. J. 1,445-466. Fitzhugh, R. 1969. Mathematical models of excitation and propagation in nerve. In Biologicul Engineering, H. P. Schwan, ed., Chap. 1. McGraw Hill, New York.
Network of Spiking Neurons
31
Gerstner, W. 1991. Associative memory in a network of “biological” neurons. In Advances in Neurallnformation Processing Systems, 3, D. Touretzky,ed. Morgan Kaufmann, San Mateo, CA. Goldberg, J. M., and Fernandez, C. 1971. Physiology of peripheral neurons innervating semicircularcanals of the squirrel monkey. 111: Variations among units in their discharge properties. 1.Neurophys. 34, 676. Gray, C. M., KONg, P., Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338,334. Gray, C. M., and Singer, W. 1989. Stimulus-specific neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. U.S.A. 86, 1698. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79,2554. Josh,J. V., Kadanoff, L. P., Kirkpatrick, S., and Nelson, D. R. 1977. Renormalization, vortices, and symmetry-breakingperturbations in the two-dimensional planar model. Phys. Rev.B 16,1217. Kruglyak; L. 1990. From biological reality to simple physical models: Networks of oscillating neurons and the XY model. Ph.D. thesis, University of California at Berkeley, Berkeley, CA. Kruglyak, L., and Bialek, W. 1991a. Analog computation at a critical point: A novel function for neuronal oscillations? In Advances in Neural Information Processing Systems, 3, D. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Kruglyak, L., and Bialek, W. 1991b. From biological reality to simple physical models: Networks of oscillating neurons and the XY model. In Neural Networks: From Biology to High Energy Physics, 0. Benhan, C. Bosio, P. Del Giudice, and E. Tabet, eds., ETS Editrice, Pisa, 1992. Kosterlitz, J. M., and Thouless, D. J. 1973. Ordering, metastability, and phase transitions in two-dimensional systems. J. Phys. C: Solid State Phys. 6,1181. Kuramoto, Y. 1984. Chemical Oscillations, Waves, and Turbulence. Springer, Berlin. Morris, C., and Lecar, H. 1981. Voltage oscillations in the barnacle giant muscle fiber. Biophys. J. 35, 193-213. Nagumo, J. S.,Arimoto, S., and Yoshizawa, S. 1962. An active pulse transmission line simulating a nerve axon. Proc. I. R. E . 50, 2061. Nelson, D. R. 1983. Defect-mediated phase transitions. In Phase Transitions and Critical Phenomena, C. Domb and J. L. Lebowitz, eds., Vol. 7, Chap. 1. Academic Press, London. Rinzel, J., and Ermentrout, G. B. 1989. Analysis of neural excitability and oscillations. In Methods in Neuronul Modeling, C. Koch and I. Segev, eds., Chap. 5. The MIT Press, Cambridge, MA. Teich, M. C., and Khanna, S. M. 1985. Pulse-number distribution for the neural spike train in the cat‘s auditory nerve. J.Acoust. SOC.Amer. 77, 1110. Received 3 September 1991; accepted 29 May 1992.
This article has been cited by: 2. Henry Tuckwell, Laurent Toubiana, Jean-Francois Vibert. 2000. Enhancement of epidemic spread by noise and stochastic resonance in spatial network models with viral dynamics. Physical Review E 61:5, 5611-5619. [CrossRef] 3. Henry Tuckwell, Laurent Toubiana, Jean-Francois Vibert. 1998. Spatial epidemic network models with viral dynamics. Physical Review E 57:2, 2163-2169. [CrossRef] 4. Roger Rodriguez, Henry Tuckwell. 1996. Statistical properties of stochastic nonlinear dynamical models of single spiking neurons and neural networks. Physical Review E 54:5, 5585-5590. [CrossRef]
Communicated by James Anderson
Acetylcholine and Learning in a Cortical Associative Memory Michael E. Hasselmo Department of Psychology, Harvard University, Cambridge, MA 02138 USA
Implementing associativememory function in biologically realistic networks raises difficulties not dealt with in previous associative memory models. In particular, during learning of overlapping input patterns, recall of previously stored patterns can interfere with the learning of new patterns. Most associative memory models avoid this difficulty by ignoring the effect of previously modified connections during learning, thereby clamping activity to the patterns to be learned. Here I propose that the effects of acetylcholine in cortical structures may provide a neurophysiological mechanism for this clamping. Recent brain slice experiments have shown that acetylcholine selectively suppresses excitatory intrinsic fiber synaptic transmission within the olfactory cortex, while leaving excitatory afferent input unaffected. In a computational model of olfactory cortex, this selective suppression, applied during learning, prevents interference from previously stored patterns during the learning of new patterns. Analysis of the model shows that the amount of suppression necessary to prevent interference depends on cortical parameters such as inhibition and the threshold of synaptic modification, as well as input parameters such as the amount of ovelc lap between the patterns being stored. 1 Introduction
A wide range of neural network models have suggested that associative memory function may depend on excitatory intrinsic connections within the cortex. These include both linear associative matrix memories (Anderson 1983; Kohonen 1984) and models related to spin glass systems (Hopfield 1982). However, the majority of these models have focused on network dynamics during the recall of previously stored memories. During learning of new memories, most associative memory models ignore the effect of intrinsic connections within the network by clamping the activity of units to the desired pattern, by computing synaptic modification independently of network dynamics, or by applying learning before the spread of activation. This allows use of the Hebb rule for computation Neural Computation 5 , 3 2 4 (1993)
@ 1993 Massachusetts Institute of Technology
Acetylcholine and Learning
33
of a clean outer product for each input pattern being stored. The sum of these outer products is computed for m different patterns and stored as an intrinsic excitatory connectivity matrix By as follows, where A!p) represents element i of pattern p:
Despite the common use of clamping in associative memory models, no neurophysiological mechanism has previously been presented describing how the brain might suppress normal network recall dynamics during learning. Without clamping, and with learning rates that are similar to or slower than the update of activation, recall of previously stored patterns will interfere with the learning of new patterns. This paper presents a neurophysiological mechanism that may prevent interference during learning in a cortical associative memory. 2 The Problem: Interference during
Learning
The learning rule in associative memory models is taken to be analogous to the phenomenon of long-term potentiation within cortical structures. However, biological evidence suggests that long-term potentiation cannot occur until the presynaptic activity (aj) has reached the terminal bouton and influenced postsynaptic activity (ai). Thus, associative memory models should not apply a learning rule until activity has propagated across the synapse being modified. Rather than applying the learning rule immediately, before the spread of activity, as in ABij(t) = ai(t)aj(t), learning should be applied only after synaptic transmission has had time to influence the postsynaptic activity, as in ABij(t 1) = ai(t l)aj(t). This presents a difficulty for associative memory models. Unless activity is clamped to the desired pattern during learning, activity will be influenced by intrinsic connections modified by previously stored patterns. Thus, when a new pattern is presented to the network, the activity will depend partly on previously learned patterns that overlap with the new pattern. For instance, after learning of one pattern A ( ] ) ,the connectivity matrix Bfi = A&')&(*).In this case, presentation of a second pattern A@)will result in postsynaptic activity ai combining the input pattern and the spread of activation along excitatory intrinsic connections:
+
+
+ where n = the number of neurons in the network. If this activity is taken as the postsynaptic activity in the Hebbian learning rule, and presynaptic
Michael E. Hasselmo
34
activity is taken as U j = A:), the modification of synaptic strength takes the form: ABij(t + 1) =
Ui(t
+ l ) ~ j ( =t )
Thus, in addition to the outer product of the pattern with itself, Ai@)Aj@), the connectivity matrix also contains the outer product between the two patterns stored, Ai(1)Aj(2), scaled to the dot product between these two patterns. For orthogonal patterns, the dot product equals zero, and no interference during learning occurs. However, for nonorthogonal patterns, interference during learning will occur between each new pattern p and all previously stored patterns q according to their direct overlap, and also according to the overlap between these patterns and intervening patterns stored within the network. This adds an interference term to the learning rule as follows:
[
AB!@ '1 = A!P)A!P) ' I + ql=l
91
( 2 A p ) A p ) )( 2 A p ) A p ) ) k=l
k=l
Coupled with the effect of each connection enhancing its own growth, this can lead to a positive feedback cycle, resulting in runaway synaptic modification. In simulations, if there is no set of patterns completely orthogonal to other sets of patterns, if all patterns are trained in parallel, and if the strength of connections saturates at some value, this interference can ultimately result in a connectivity matrix more closely resembling p=l q=l
This is useless for associative memory storage, since presentation of any one memory will recall elements of all memories stored within the network. While described here for linear associative memories, the effect of interference during learning also appears in associative memory models based on the spin-glass analogy (Hasselmo et al. 1991). In particular, interference will be even more severe if activation is updated for several time steps during learning. Modifications of the Hebbian learning rule have been used to decrease the level of interference between memories during recall, but unless they change the normal dynamics of the network during learning, they cannot prevent initial interference between
Acetylcholine and Learning
35
: ’ No cholinergic suppression Afferent fibers
/
Intrinsic fibers
L
Conml
IOOpM Carbschol
Conml
1O@MCubschol
Figure 1: Experimental results showing selective cholinergic suppression of intrinsic fiber synaptic transmission within the piriform cortex. Cholinergic agonists such as carbachol have little effect on synaptic potentials elicited by stimulation of afferent fibers arriving from the olfactory bulb, but strongly suppress synaptic potentials elicited by stimulation of intrinsic fibers arising from other pyramidal cells within the cortex. overlapping memories during learning. Thus, the covariance learning rule (Sejnowski and Stanton 19901, or learning rules incorporating decay (Kohonen 1984) can gradually decrease interference during parallel learning of all memories, but during sequential learning of each memory independently, interference will occur at least in the initial stages of learning a new pattern. Thus, some neurophysiological mechanism for clamping cortical activity to the input pattern would be necessary to completely prevent interference during learning in a cortical associative memory. 3 A Solution: Suppression of Synaptic Transmission by
Acetylcholine
I have recently developed a model of a neurophysiological mechanism for the clamping of network activity to the input pattern during learning. The initial motivation for this model arose from experiments in brain slice preparations of piriform (olfactory) cortex (Hasselmo and Bower, 19921, as illustrated in Figure 1. Piriform cortex contains clear laminar segrega-
36
Michael E. Hasselmo
tion of excitatory afferent and intrinsic fiber synapses, allowing selective study of synaptic transmission at these two sets of synapses. I have found that acetylcholine and cholinergic agonists such as carbachol selectively suppress synaptic potentials evoked by stimulation of the intrinsic fibers arising from other cortical pyramidal cells, while having almost no effect on synaptic potentials elicited by stimulation of afferent fibers from the olfactory bulb. This effect appears to be due to activation of presynaptic muscarinic cholinergic receptors (Hasselmo and Bower 1992). If applied during the learning of new patterns, this selective suppression of intrinsic fiber synaptic transmission can prevent interference between memories during learning. In the computational framework presented above, cholinergic suppression during learning would be comparable to homogeneously decreasing the strength of the intrinsic connectivity matrix B. In the following equations, this suppression of intrinsic fiber synaptic transmission will be represented by the coefficient c. Setting c = 1.O indicates maximal suppression of synaptic transmission and c = 0 indicates no suppression. Thus, the learning rule takes the form
In the above equation, setting c = 1.0 prevents interference during learning, and allows the clean computation of an outer product of the input pattern with itself, AF’A,?’. However, physiological evidence argues against a complete suppression of synaptic transmission during learning. First, even at high doses of cholinergic agonists, the suppression of synaptic transmission has a mean value of about 70% (Hasselmo and Bower 1992). Second, any associative memory function in this model depends on synaptic modification at the intrinsic fiber synapses that are being suppressed. As noted before, most theories of synaptic modification assume dependence on synaptic transmission at the synapses being modified. Thus, it appears unlikely that synaptic modification could occur if synaptic transmission were completely suppressed at intrinsic fiber synapses. However, as described in the next section, simulations of associative memory function in a model of olfactory cortex show that interference during learning can be prevented with submaximal suppression of synaptic transmission. This work is summarized below. 4 Computational Model of Olfactory Cortex
I have studied the effect of this cholinergic modulation in a model of piriform cortex as an associative memory network, as shown in Figure 2 (Hasselmo et al. 1991, 1992). In the model of piriform cortex, the initial input of pattern p sets the neuron activity a k ( t ) = Ak(P),and the activation
Acetylcholine and Learning
37
Figure 2: Schematic diagram of the computational model of piriform cortex. The model incorporates afferent fiber input from the olfactory bulb as well as excitatory intrinsic connections arising from other pyramidal cells. In addition, the model incorporates the effects of inhibitory interneurons mediating feedback inhibition. The output of the model is shown during recall in response to a degraded version of a previously learned input pattern (larger black boxes and darker shading of neurons represent greater activity). Neuron activity spreads along previously strengthened connections to complete missing elements of the input pattern.
of each neuron i is synchronously updated for one time step according to the equation n
Ui(f
I ) = A?’
n
+ k=lC(1- C)Bikg[&(t) - P] - k=lCHikg[ak(t)- P]
where u = neuron activation, A = the afferent input pattern, B = the intrinsic connectivity matrix, H = the feedback inhibition connectivity
38
Michael E. Hasselmo
matrix, p = the output threshold, c = the level of cholinergic suppression, and the input/output function g[ak(f)-p] = tanh[ak(t)-p]. For activation levels below p, output was zero. For levels above p, output increased to an asymptotic value usually set at 1.0. During recall, c is set at 0 (no suppression), while during learning c is set to values between 0 and 1.0, with 1.0 representing maximal suppression (no synaptic transmission). Learning occurs after one time step using a common adaptation of the Hebb rule reflecting direct dependence on postsynaptic depolarization, with fl = the postsynaptic threshold of synaptic modification, and 7 = the coefficient of learning:
In the computational model of piriform cortex, the selective suppression of intrinsic fiber synaptic transmission, applied during learning, prevents interference between overlapping input patterns being stored (Hasselmo et ul. 1991, 1992). The effects of acetylcholine can be seen clearly if we consider the recall of the model at different stages of learning, as shown in Figure 3. This figure shows the recall of the model in response to a degraded version (R)of a learned pattern (L) across 50 cycles of learning. Five overlapping input patterns are being stored within the network, but the figure shows recall of only one pattern. As seen on the top, without cholinergic suppression during learning, previously stored patterns interfere with the storage of new patterns. This interference has a positive feedback effect, which very rapidly causes the network to respond the same to all input patterns, with a broad, homogeneous level of activity containing components of all learned patterns. On the other hand, with 70% cholinergic suppression implemented in the model, previously stored patterns do not interfere with the storage of other patterns. Thus, the model learns to respond to the degraded input (R)with the full learned version of that input (L), completing the missing input lines. Some interference does appear during recall, but this is far more limited than what occurs without the effects of acetylcholine. 5 Cholinergic Suppression in a Linear Associative Memory
The amount of cholinergic suppression of intrinsic fiber synaptic transmission necessary to prevent interference during learning can be determined with an analytical description assuming linear associative memory properties. The starting point of this description is obtained by combining the activation equation and the learning rule for the model of olfactory cortex. In this case, the synaptic modification after one time step will take the form
Acetylcholine and Learning
39
Figure 3: The recall of the network is shown at different stages of learning without cholinergic suppression (0% suppression) and with 70% cholinergic suppression. Five overlapping patterns are being stored within the network, but recall is shown for only one pattern. The learned version of this pattern is shown above the letter L, with six active input lines (black boxes). The version of the pattern presented during recall is shown above the letter R, with only four active input lines. The output of the network in response to the recall pattern is shown after 0 to 50 learning cycles on the right (larger black boxes represent greater output). Without cholinergic suppression, the network eventually starts to respond to the recall pattern with elements of all learned patterns. With 70% cholinergic suppression, the network eventually responds to the recall pattern with completion of the missing components of the learned pattern, showing little interference from other learned patterns.
Michael E. Hasselmo
40
For purposes of simplification, we will take the input/output function g( ) as linear with slope = I, and threshold p = 0, and set ak(t) = Ak@)and 77 = 1. In addition, we will assume feedback inhibition is homogeneous and the input vectors all have the same length, so we can represent xi=,H&zk(t) - p] as the constant H. Thus, we obtain
ABv = A?)
+ e(1- c)BikAf”- H - a] A,?’) k=l
To represent learning of previous memories A(q)in a simplified form, we set the intrinsic connectivity matrix Bik = Cy=;=, Ai(q)Ak(q).Thus, for learning of a new memory p, the intrinsic matrix will be adjusted according to
This contains the clean outer product of pattern p with itself, plus an interference term scaled to the overlap with previously stored patterns 9. If we consider a network where the weights grow to a saturating value, sufficient learning will prevent the interference term from distorting storage of individual patterns when it takes values less than zero. However, for values greater than zero, interference will spread between patterns. Thus, the spread of interference during learning between patterns is prevented if
For a maximum value of A?) = 1, this condition is satisfied as long as cholinergic suppression is strong enough that
Thus, the level of suppression necessary to prevent interference between two memories decreases with higher levels of inhibition and a higher threshold of synaptic modification. Though this was derived using considerable simplifications of the model, the nonlinear computer simulation of piriform cortex shows qualitatively similar results. As shown in Figure 4, changing the strength of inhibition in the simulation changes the level of cholinergic suppression necessary to prevent interference. Notice from the equations, however, that as the values of inhibition and the threshold of synaptic modification increase, in addition to preventing interference between patterns, they can also directly impede the learning of the new pattern. Thus, the strength of these different parameters must be balanced. Fortunately, the simulation shows that interference can be prevented within a physiologically realistic range for the values of cholinergic suppression, inhibition and the threshold of synaptic modification.
Acetylcholine and Learning
41
Figure 4 Increasing the strength of inhibition within the model decreases the level of suppression necessary to prevent interference during learning. The recall of the model is shown with the same format as in Figure 3. On the left, the recall of the model is shown across 50 learning cycles at three different levels of cholinergic suppression (c = 0.30,0.50 and 0.701,with inhibition maintained at a strength of 0.05. On the right, the recall of the model is shown across 50 learning cycles at the same three levels of cholinergic modulation, with inhibition maintained at a strength of 0.10. Note that with stronger inhibition less cholinergic suppression is required to prevent interference during learning. The analysis presented here also shows that the level of cholinergic suppression necessary to prevent interference increases when there is a greater amount of overlap (larger sum of dot products) between the new pattern p being stored and all other patterns q previously stored in the network. This could be interpreted as a measure of the capacity of the network, beyond which interference during learning will occur. 6 Discussion
In a model of piriform cortex, cholinergic suppression of intrinsic fiber synaptic transmission (Hasselmo and Bower 1992) prevents previously
42
Michael E. Hasselmo
learned patterns from interfering with the storage of new overlapping patterns. This effect is similar to clamping the activity of the neurons to the desired pattern during learning. Thus, this provides the fist description of a neurophysiological mechanism for what has been a standard feature of learning in the majority of associative memory models (Anderson 1983; Kohonen 1984; Hopfield 1982). While the model presented here considers modification of excitatory intrinsic feedback connections within one cortical region, since the learning rule involves an association between presynaptic activity at time t and postsynaptic activity at time t + 1, the results apply equally well to modification of associational connections between different cortical regions receiving separate afferent input. Numerous studies have shown that muscarinic cholinergic antagonists impair memory function in a broad range of behavioral tasks (for a review, see Hagan and Moms 1989). In particular, much evidence supports the suggestion of the model that cholinergic blockade selectively impairs learning, while having less effect on the recall of previously learned information (Ghonheim and Mewaldt 1975). In addition, the prediction of the model that loss of cholinergic suppression should more strongly affect learning of overlapping input patterns is supported by results showing that muscarinic cholinergic antagonists cause greater difficulty in learning tasks with multiple or irrelevant cues, or tasks with less discriminable stimuli (Hagan and Morris 1989). The model presented here generates the prediction that loss of the cholinergic modulation of cortical structures should more strongly affect the learning of memories with overlapping components than the learning of memories without overlapping components. Experimental evidence on other neuropharmacological effects of acetylcholine supports the possibility that acetylcholine clamps cortical activity to the input pattern during learning. Acetylcholine has postsynaptic effects on cortical neurons, causing increased spiking response to synaptic stimulation and current injection (ffrench-Mullen et af. 1983; Hasselmo and Bower 1992; Hasselmo and Barkai 1992). These effects could enhance the response of cortical neurons to afferent input during the suppression of intrinsic fiber synaptic transmission, helping to clamp neurons to afferent input patterns. In the model presented here, insufficient suppression of synaptic transmission allows interference during learning to cause runaway synaptic modification. This pattern of breakdown may prove a useful model for the initiation and progression of cortical degeneration found in Alzheimer’s disease. In Alzheimer’s disease, the first regions to show degeneration in the form of neurofibrillary tangles are layer I1 of entorhinal cortex, region CA1 of the hippocampus, and the subiculum (Hyman et al. 1984). These regions have characteristics which would make them particularly vulnerable to runaway synaptic modification. In particular, the projection from layer I1 of the entorhinal cortex to the molecular
Acetylcholine and Learning
43
layer of the dentate gyrus shows strong associative long-term potentiation (McNaughton et al. 19781, but the outer molecular layer does not show cholinergic suppression of synaptic transmission (Kahle and Cotman 1989). In Alzheimer’s disease, tangles first appear at the cells of origin of this projection, and strong plaque deposits appear in the terminal field. While the specific causality of the neuropathology found in Alzheimer’s disease has not been determined, it is plausible that it could be due to excessive metabolic demands produced by the type of runaway synaptic modification found within the model, or to excitotoxic effects accompanying the excessive strengthening of excitatory synaptic connections. In this case, the analysis described here could explain the selective vulnerability of specific cortical subregions in terms of an imbalance of a range of cortical parameters, including inhibition, the threshold of synaptic modification, cholinergic modulation, and the relative overlap of input patterns arriving from other cortical regions. Acknowledgments This work was supported by a grant from the French Foundation for Alzheimer Research. I thank Matt Wilson for development of the simulation graphics, Brooke Anderson for early development of the olfactory cortex simulation, Ross Bergman for current programming support’ and Alan Yuille and Fred Waugh for comments on the manuscript. References Anderson, J. A. 1983. Cognitive and psychological computation with neural models. IEEE Trans. Systems, Man, Cybern. SMC-13,799-815. ffrench-Mullen, J. M. H., Hori, N., Nakanishi, H., Slater, N. T., and Carpenter, D. 0. 1983. Asymmetric distribution of acetylcholine receptors and M channels on prepyriform neurons. Cell. Mol. Neumbiol. 3, 163-182. Ghonheim, M.M.,and Mewaldt, S. I? 1975. Effects of diazepam and scopolamine on storage, retrieval, and organizational processes in memory. PsyChOpha~COlOgia44,257-262. Hagan, J. J., and Morris, R. G. M.1989. The cholinergic hypothesis of memory: A review of animal experiments. In Psychopharmacologyof the Aging Nervous System, L. L. Iversen, S. D. Iversen, and S. H. Snyder, eds., pp. 237-324. Plenum Press, New York. Hasselmo, M. E., and Barkai, E. 1992. Cholinergic modulation of the input/output function of rat piriform cortex pyramidal cells. SOC.Neurosci. Abstr. 18, 521. Hasselmo, M. E., and Bower, J. M. 1992. Cholinergic suppression specific to intrinsic not afferent fiber synapses in rat piriform (olfactory) cortex. 1. Neurophysiol. 67(5), 1222-1229.
44
Michael E. Hasselmo
Hasselmo, M. E., Anderson, B. P., and Bower, J. M. 1991. Cholinergic modulation may enhance cortical associative memory function. In Advances in Neurul 1nformutionProcessing Systems3, R. P.Lippman, J. Moody, and D. S. Touretsky, eds., pp. 46-52. Morgan Kaufmann, San Mateo, CA. Hasselmo, M. E., Anderson, B. P., and Bower, J. M. 1992. Cholinergic modulation of cortical associative memory function. 1.Neurophysiol. 67(5), 12301246. Hopfield, J. J. 1982. Neural networks and physical systems with emergent selective computational abilities. Proc. Nutl. Acud. Sci. U.S.A. 79,2554-2559. Hyman, B. T., Damasio, A. R., Van Hoesen, G. W., and Barnes, C. L. 1984. Cell specific pathology isolates the hippocampal formation in Alzheimer’s disease. Science 225, 1168-1170. Kahle, J. S., and Cotman, C. W. 1989. Carbachol depresses the synaptic responses in the medial but not the lateral perforant path. Bruin Res. 482, 159-163. Kohonen, T. 1984. Self-organitation and Associative Memory. Springer-Verlag, Berlin. McNaughton, B. L., Douglas, R. M., and Goddard, G. V. 1978. Synaptic enhancement in fascia dentata: Cooperativity among coactive afferents. Bruin Res. 157,277-293. Sejnowski, T. J.,and Stanton, P.K. 1990. Covariance storage in the hippocampus. In An Introduction to Neurul and Electronic Networks, S. Zornetzer, J. Davis, and C. Lau, eds., pp. 365-375. Academic Press, San Diego. Received 10 December 1991; accepted 12 May 1992.
This article has been cited by: 2. A. David Redish, Steve Jensen, Adam Johnson. 2008. A unified framework for addiction: Vulnerabilities in the decision process. Behavioral and Brain Sciences 31:04. . [CrossRef] 3. X. Zhang, A.A. Minai. 2004. Temporally Sequenced Intelligent Block-Matching and Motion-Segmentation Using Locally Coupled Networks. IEEE Transactions on Neural Networks 15:5, 1202-1214. [CrossRef] 4. Friedrich T. Sommer, Thomas Wennekers. 2003. Models of distributed associative memory networks in the brain. Theory in Biosciences 122:1, 55-69. [CrossRef] 5. Silvia Scarpetta , L. Zhaoping , John Hertz . 2002. Hebbian Imprinting and Retrieval in Oscillatory Neural NetworksHebbian Imprinting and Retrieval in Oscillatory Neural Networks. Neural Computation 14:10, 2371-2396. [Abstract] [PDF] [PDF Plus] 6. Gayle M. Wittenberg, Megan R. Sullivan, Joe Z. Tsien. 2002. Synaptic reentry reinforcement based network model for long-term memory consolidation. Hippocampus 12:5, 637-647. [CrossRef] 7. Manuel A. Sánchez-Montañés , Paul F. M. J. Verschure , Peter König . 2000. Local and Global Gating of Synaptic PlasticityLocal and Global Gating of Synaptic Plasticity. Neural Computation 12:3, 519-529. [Abstract] [PDF] [PDF Plus] 8. Jean-Marc Fellous, Christiane Linster. 1998. Computational Models of NeuromodulationComputational Models of Neuromodulation. Neural Computation 10:4, 771-805. [Abstract] [PDF] [PDF Plus] 9. Asnat Greenstein-Messica , Eytan Ruppin . 1998. Synaptic Runaway in Associative Networks and the Pathogenesis of SchizophreniaSynaptic Runaway in Associative Networks and the Pathogenesis of Schizophrenia. Neural Computation 10:2, 451-465. [Abstract] [PDF] [PDF Plus] 10. David Horn , Nir Levy , Eytan Ruppin . 1998. Memory Maintenance via Neuronal RegulationMemory Maintenance via Neuronal Regulation. Neural Computation 10:1, 1-18. [Abstract] [PDF] [PDF Plus] 11. A. David Redish, David S. Touretzky. 1997. Cognitive maps beyond the hippocampus. Hippocampus 7:1, 15-35. [CrossRef] 12. Christiane Linster, Claudine Masson . 1996. A Neural Model of Olfactory Sensory Memory in the Honeybee's Antennal LobeA Neural Model of Olfactory Sensory Memory in the Honeybee's Antennal Lobe. Neural Computation 8:1, 94-114. [Abstract] [PDF] [PDF Plus] 13. Paul E. Patton, Bruce McNaughton. 1995. Connection matrix of the hippocampal formation: I. The dentate gyrus. Hippocampus 5:4, 245-286. [CrossRef] 14. Hans Liljenström. 1995. Autonomous learning with complex dynamics. International Journal of Intelligent Systems 10:1, 119-153. [CrossRef]
15. James M. Bower. 1993. The modulation of learning state in a biological associative memory: an in vitro, in vivo, and in computo study of object recognition in mammalian olfactory cortex. Artificial Intelligence Review 7:5, 261-269. [CrossRef] 16. Barry Horwitz, Olaf Sporns. 1993. Neural modeling and functional neuroimaging. Human Brain Mapping 1:4, 269-283. [CrossRef]
Communicated by David Field
Convergent Algorithm for Sensory Receptive Field Development Joseph J. Atick The Rockefeller University, 1230 York Avenue, New York, NY 10021 U S A
A. Norman Redlich School of Natural Sciences, Institute for Advanced Study, Princeton, NJ 08540 U S A
An unsupervised developmental algorithm for linear maps is derived which reduces the pixel-entropy (using the measure introduced in previous work) at every update and thus removes pairwise correlations between pixels. Since the measure of pixel-entropy has only a global minimum the algorithm is guaranteed to converge to the minimum entropy map. Such optimal maps have recently been shown to possess cognitively desirable properties and are likely to be used by the nervous system to organize sensory information. The algorithm derived here turns out to be one proposed by Goodall for pairwise decorrelation. It is biologically plausible since in a neural network implementation it requires only data available locally to a neuron. In training over ensembles of two-dimensional input signals with the same spatial power spectrum as natural scenes, networks develop output neurons with centersurround receptive fields similar to those of ganglion cells in the retina. Some technical issues pertinent to developmental algorithms of this sort, such as "symmetry fixing," are also discussed. 1 Introduction
Recent theoretical results on neural processing support the idea that efficiency of information representation in the sensory pathways could have cognitive advantages [Barlow 1989 (+REFS); Linsker 1988; Atick and Redlich 19901. This is a predictive idea since it leads to the hypothesis that much of the processing in the early stages is geared toward recoding incoming sensory signals into a more efficient form. Starting with natural signals one can assess the efficiency of the sampled representation formed by the array of sensory cells and mathematically derive recodings that would improve the efficiency. These recodings can then be compared with the multistages of neural processing observed. Neural Computation
5,45-60 (1993)
@ 1993 Massachusetts Institute of Technology
46
Joseph J. Atick and A. Norman Redlich
One form of efficient representations that figures prominently in the recent literature is the so called "minimum entropy'' one,' where the sum of the individual entropies for the elements of the representation (e.g., pixels) is minimal for the ensemble of natural signals (Field 1987;Barlow 1989; Barlow et al. 1989;Atick and Redlich 1990,1992). This minimum is achieved when there is the least possible statistical dependence between elements. The idea that the nervous system could be engaged in trying to build such a minimum entropy representation of the environment has been tested in the limited context of retinal processing (Atick and Redlich 1992). There it was assumed that the retina, being the first stage in the visual pathway, could reduce pixel-entropy by eliminating no higher than two-point correlations (pairwise correlations). The linear transform on the photoreceptor activities needed to achieve pixel-pixel decorrelations was shown to agree with observed retinal filters-after being careful to take noise into account. In general, the problem of finding entropy reducing maps is very difficult. It is also very unlikely that one will be able to analytically solve for the explicit form of these maps as was done for the pairwise decorrelating map in the retina. An alternate approach is to use neural networks to try to compute these maps. What one needs are developmental algorithms that iteratively reduce statistical dependence among the elements of the representation as a network is trained over more sensory inputs, and preferably algorithms that are guaranteed to converge to the optimal maps. In this paper we derive a simple developmental algorithm, for the linear class of maps, which we prove lowers pixel-entropy at each learning stage. For this class of maps lowering pixel-entropy is equivalent to decreasing pairwise correlations at each step. The algorithm turns out to be identical to one originally introduced by Goodall (1960)who was interested in decorrelation in a different context. When introduced, Goodall's algorithm was proven-without reference to an entropy measureto converge to the solution that pairwise decorrelates. For us this old proof is an independent check of convergence since our entropy measure has no local minima. Therefore our demonstration that the Goodall algorithm successively lowers the entropy is sufficient in itself to prove it converges to the global minimum. This minimum entropy solution is the one that we have previously shown predicts the linear processing observed in the retina-after incorporating noise filtering. One major purpose of this paper is to demonstrate that these ganglion cell receptive fields can also be developed by applying the Goodall algorithm to an ensemble with the same second-order statistical properties (same power spectrum) as natural scenes. Also we expect insight gained from this simple algorithm to be helpful in discovering more complex algorithms capable of produc'In previous papers we have referred to this as minimum redundancy representation, since it eliminates the part of redundancy that is due to statistical dependence among the elements.
Algorithm for Sensory Receptive Field Development
47
ing minimum entropy representations for nonlinear maps that reduce statistical dependence beyond pairwise decorrelation. We should point out that there are several other pairwise decorrelating algorithms in the literature (Kohonen and Oja 1976; Oja 1982; Linsker 1986; Barlow and Foldiak 1989; Foldiak 1989; Sanger 1989; Rubner and Schulten 1990). However, the algorithm that we are presenting here is of particular interest to us since it is proven to reduce our previously introduced entropy measure. Another nice property of this algorithm is its locality in the sense that all of the data needed by a synapse to modify its strength is available at the input to the neuron. This means that the algorithm is at least plausibly one that might be implemented biologically. By actually attempting here to derive receptive fields based on a "natural'' input ensemble we are also forced to face an important issue that also arises in implementing similar decorrelation algorithms, but has not to our knowledge been resolved. The problem is that decorrelation itself does not guarantee a unique solution for the receptive fields, and most of the solutions are not localized. This problem can be traced to a large symmetry under which any decorrelating solution remains decorrelating. Here we theoretically analyze the requirements for fixing this symmetry and we find a simple way to do so that leads to localized receptive fields. 2 Entropy Reduction and Convergence
We begin by assuming that the sensory input signal {Si}, representing the set of photoreceptor responses, is recoded through a linear map Kq to the set of outputs { Oi}, which in the retina are the ganglion cell responses:
As in our previous work (Atick and Redlich 1990,19921, we introduce an "entropy" measure E{K}, that grades different recodings K according to how well they minimize the sum of pixel entropies, without overall loss of information:
E{K}
= Tr(K. R
. KT) - logdet(KT.K)
(2.2)
In 2.2 Rij = (SiSj) is the autocorrelator for the input ensemble, with brackets denoting ensemble average; boldface denotes matrices. Minimizing the first term was shown to be equivalent to minimizing the sum of the individual pixel entropies, while the second term acts to enforce reversibility of the map (no information loss). When the measure 2.2 is minimized, 6E/6K = 0, one gets a decorrelating solution K satisfying K . R.K~ = I
(2.3)
JosephJ. Atick and A. Norman Redlich
48
which when convoluted with a noise filter was shown to reproduce ganglion cell receptive fields (Atick and Redlich 1992). Although mathematically it is straightforward to find a decorrelating solution K satisfymg 2.3, it is not clear how a network of neurons can arrive at such a K. What is needed is a biologically plausible develop mental algorithm that can be proven to converge to a K satisfymg 2.3. One way to do this is to find a small update 6K for the map K that is guaranteed to lower E{K} at each step. This requires that the change in E{K} due to 6K must be negative:
:(
)
6E{K} = Tr - .6KT < 0
(2.4)
One obvious possibility is to use gradient descent 6KT = - (6E/6K)T. However this leads to a nonlocal algorithm that at each update stage requires the computation of an inverse matrix. We propose instead the update (2.5)
where from 2.2 6E - = 2 [K - R - (KT)-']
(2.6)
6K
Like gradient descent this update always reduces E{K} since by 2.4 and 2.5 6E{K} = - T r ( [ K . R . K T - 1 ] [ K . R . K T - 1 I T )
(2.7)
which is always negative (or zero upon convergence) since Tr(M.MT)> 0 for any matrix M # 0. (For the mathematically minded reader this shows that E can be thought of as a Lyapunov functional.) The next step is to show that the update 2.5 can be implemented as an algorithm requiring no nonlocal calculations. For this we rewrite 2.5 as an update for K-I instead of KT, using the fact 6K-' = -K-'.bK.K-'. For later convenience, we also change notation now and define W = K-I; the variables Wv will designate the actual synaptic strengths in the feedback network introduced in the next section. Then in terms of W the algorithm 2.5 becomes (2.8)
where T is a time constant setting the update rate. Although W-I does appear in 2.8, it appears in the combination (W-l)T, which as will be discussed in the next section is equal to (SiOj),and thus in the network Re
Algorithm for Sensory Receptive Field Development
49
implementation no nonlocal computations will be needed. The algorithm 2.8 can be recognized as the ensemble averaged form of the Goodall (1960) algorithm. Since direct minimization of E{K} in 2.2 gives the global minimum 2.3 with no local minima, our demonstrating that 2.8 or 2.5 always reduces E{K} is sufficient to prove convergence. However, for completeness we shall also give the old proof of convergence (Goodall 1960) that does not require a minimization measure. One reason we give this proof here is that it leads us to rewrite the update algorithm 2.8 in yet another form that will be useful in our later discussion of symmetry fixing and locality. To start the proof we need, in addition to 2.8, the transpose equation (2.9) using R = RT. By multiplying 2.8 from the right by WT and 2.9 from the left by W and then adding the resulting equations, we arrive at the integrable differential equation r dt d(W.WT)=2(R-W.WT)
(2.10)
which has the solution
where C is a constant matrix determined by initial conditions. In the limit where t / r becomes large We WT = R, which is just the minimum E{K} solution 2.3 since W = K-'. 3 Network Implementation
In this section we discuss a network implementation of the above algorithm. As shown in Figure 1 the network has one nontrivial layer, with feedforward connections directly from the input Si and lateral feedback connections among the neurons. It is assumed that the input Si is directly connected only to the ith neuron, with link strength unity. The plastic links in this network are the lateral feedback links that connect the output of the jth neuron back to the input of the ith neuron with link strength Wq. The neurons are assumed to be linear, thus the dynamics of this network can be written as
where Oi is the output of the ith neuron, Sj is its feedforward input, and
50
Joseph J. Atick and A. Norman Redlich
01
Figure 1: The architecture of the network used for decorrelation. It is assumed that the ith neuron receives direct input with synaptic weight unity from Si and feedback input from the outputs of all neurons with weights W,. Only the feedback links Wv are updated during learning.
T is a time constant. Below we shall assume that T << T for the update algorithm 2.8 so we can safely approximate the system by its equilibrium solution, dOi/dt = 0, si
=
c
wijoj
(3.2)
i
This shows that the matrix W, can be identified with the inverse of the retinal kernel Kij introduced above, so Wij-l corresponds to the ith ganglion cell’s receptive field. In terms of the network variables Si, Oj, Wi,, the algorithm 2.8 can be written as (3.3)
Algorithm for Sensory Receptive Field Development
51
where the brackets denote an ensemble average. To see that this is the same as 2.8 note that (SiOj) = Ck(SiSk)yil and (SiSj) = Rv. AS long as 7 is much greater than the characteristic time scale of the ensemble we can remove the averaging brackets in 3.3 and finally arrive at the algorithm that requires no a priori knowledge of any ensemble properties (no prior knowledge of R)
(3.4) Another significant fact about 3.4is that it is biologically plausible in the sense that all of the data, Si, Oj, Wv needed for the ith neuron to update its synaptic links, Wij, is available locally to that neuron (see Fig. 1). In Section 6 we simulate 3.4 for an ensemble of inputs with the same autocorrelator as natural scenes. But before we can do that we need to discuss the important technical issue of the nonuniqueness of the decorrelating solutions, and the related problem of nonlocalized receptive fields. 4 Finding Localized Receptive Fields
We would like next to apply the learning algorithm 3.4 to generate linear receptive fields for an ensemble of "natural" input images. However, if one naively employs this learning algorithm, the receptive fields very often turn out to be nonlocalized, and thus do not resemble those of retinal ganglion cells. The problem is that the algorithm 3.4or 2.8, though guaranteed to converge to a painvise decorrelating solution, W . W T= R, does not find a unique solution W, and most of the solutions are not localized. The source of the nonuniqueness is that W . WTpossesses a large symmetry, since any transform W -, W U of a solution W by any orthogonal matrix U (U. UT = 1) is also a solution. This means that there is a whole class of acceptable solutions parameterized by U. However, from a biological point of view, most of these solutions are unacceptable since they correspond to nonlocalized receptive fields. So the problem is to eliminate the extra symmetry U in a way that guarantees localization. In other words, we wish to remove the extra degrees of freedom in W due to this symmetry by constraining the form of W,while preserving its decorrelating property: W . WT= R. One type of constraint that does this is
-
W=WT
(4.1)
since there is no longer any freedom to multiply W by U without violating this condition. That this condition removes all the extra freedom follows from the fact that the number of independent components of the matrix
Joseph J. Atick and A. Norman Redlich
52
W that are eliminated is N(N - 1)/2, which is precisely the number of independent components in U. The reason for our choice of constraint 4.1 is that, for all autocorrelators R that we consider, 4.1 leads to localized, translation invariant receptive fields. To see this note that 4.1 implies that W satisfies the more restricted equation W W = R, which has (up to sign) a unique solution since the U symmetry has been eliminated. This means that if R is translation invariant so is W.2 Furthermore, translation invariance of W allows us to go to frequency space and write w ( f ) = R(f). Recall then that W(f) is the inverse of the receptive field kernel K(f), so K(f) = This means that K(f) will be localized so long as R(f) has the property that it is a sufficiently smooth function of frequency f. One can show that for the receptive field to be localized to an a p proximate size D, the smoothness criterion for K(f) [and thus implicitly for R(f)] is [aK(f)/df]/[DK(f)] << 1. This smoothness condition, for the appropriate D, is satisfied by the autocorrelators R(f) that we use here which are based on the properties of natural scenes. Now that we know what condition 4.1 we wish the solution W to satisfy once it converges, the next step is to find a way to guarantee that the learning algorithm 3.4 will actually converge to the specific W satisfying 4.1, rather than to one of the many other W related to it through multiplication by some U. In the next section we introduce a procedure for doing so that is well known to physicists as “gaugefixing” and that we call here “symmetry-fixing.” The discussion in that section is somewhat technical, so for the reader who is not interested in the details we state the conclusion here so that the next section can be skipped without loss of continuity: To guarantee that the system converges to a solution that satisfies 4.1 (i.e., localized solution) it turns out that it is suficient to start the algorithm at time zero with the initial condition W(t = 0) = 1 (this is sufficient but may not be necessary). While this is a very simple boundary condition, it is not obvious that it alone ensures convergence to configurations satisfying both W WT= R and W = WT,which is what we prove in the next section.
fdm.
5 Symmetry Fixing
Our approach here is to ask how the algorithm 3.4 can in general be forced to converge to a solution that satisfies some symmetry fixing constraint, say 4.1 that we argued in the last section ensures locality. To *This follows since the solution to W .W = R is unique. Therefore if a translation invariant solution is found it must be the only solution. 3An alternative way to fix the symmetry is to include in the developmental algorithm terms that attempt to produce statistical inde ndence beyond pairwise decorrelation (there is no proof yet that any development algorithm with higher order terms converges). This is assuming, of course, that higher order correlations break the U symmetry, but this is often the case (see, e.g., Hopfield 1991).
ape
Algorithm for Sensory Receptive Field Development
53
accomplish this we study both the possibility of using special initial conditions [such as W(t = 0) = 11 as well as the possibility of modlfylng the algorithm itself in a way that does not spoil its decorrelating or convergence properties. It will turn out that for the special case of the condition W = WT, no modification of the algorithm is needed if one uses the initial condition W(t = 0) = 1. However, to prove this requires that we look at the general symmetry fixing problem, where it will be necessary to modify the algorithm. One reason, also, that we explore the more general symmetry fixing problem is that the U symmetry is ubiquitous in the literature on decorrelating algorithms. Therefore, it is important to learn tools for handling this symmetry. Generally speaking, symmetry fixing is a procedure that introduces extra control over the dynamics for W in a way that automatically forces the system to converge to the W satisfying the desired constraint. The extra control as we explain next is introduced by modifying 2.8 in a way that does not interfere with the convergence proof. The key to the modification of the dynamics comes from recognizing that not only the solution W WT = R possesses the U symmetry, but also the dynamic equation 2.10, which is the fundamental equation ensuring convergence to the decorrelation solution, is invariant under the transformation W + W U(t) for any time-dependent orthogonal matrix U(t). This means that multiplication of W(t) by a time-dependent orthogonal transformation does not interfere with the convergence proof. However, it does modify the dynamics in 2.8 since that equation is not invariant under this transformation. In fact under W + W U(t) 2.8 transforms to m
dw
T-
dt
.
= R (W_')T- W + W
. dU - . UT dt
(5.1)
To see that the last term indeed drops out from the dynamic equation 2.10 for W . WT we need only to note that U . (dUT/dt) = - (dU/dt) .UT,valid since U . UT= 1. The next step is to show how the freedom to choose U(t) can be used to control the dynamics such that the system will converge to a W with a particular property, say WT = W. One might attempt to satisfy this constraint 4.1 on the solution W by starting with initial conditions that satisfy it. However, the original dynamic equation 2.8 very quickly would evolve W(t) to configurations that violate 4.1. On the other hand, the modified equation 5.1 gives us the extra freedom to pick U(t) such that the property 4.1 is maintained at all time t. For completeness we explicitly give the symmetry-fixing term in 5.1 that does maintain 4.1 for all time t, so long as Wr(0) = W(0) at time t = 0. We give it in a basis where W is diagonal with eigenvalues Ai: (5.2)
Joseph J. Atick and A. Norman Redlich
54
where the brackets denote the commutator defined for any two matrices A and B as [A,B] = A . B - Be A. One could try to simulate the dynamics in 5.1 with the explicit expression 5.2 included, however, this term appears too complex to be biologically plausible. However, if we choose the more restrictive initial condition for W(0) to satisfy [R, W-’(O)] = 0, in addition to W(0) = WT(0),then the necessary symmetry fixing term 5.2 vanishes on the first iteration, and can be shown to remain zero thereafter. So by going through the symmetry-fixing exercise we have identified a set of special initial conditions on W that do guarantee that 4.1 is maintained at all times, and moreover do so without modifying the original equation 2.8. One choice of initial conditions that very simply satisfies both [R, W-’1 = 0 and W = WT is W(0) = 1, which is the condition we use in our simulations in the next section. Without the above analysis we would not have known a priori that this initial condition is sufficient to ensure that the solution the system converges to will satisfy W = WT. This particular initial condition may be biologically plausible since it means that initially the neurons have no lateral links and that these links are developed as needed. Also, we have tested for the stability of this initial condition by starting the runs from configurations where the lateral links are nonvanishing but small, and we find small nonvanishing links do not disturb the convergence to the unique solution.
6 Simulations
To test the learning algorithm 3.4we first applied it to a one-dimensional (1D)ensemble of inputs Si, i = I, . . . , n with n = 64. For simulation purposes, we imposed periodic boundary conditions on the 1D space and generated an ensemble of Si with an autocorrelator Rii whose Fourier transform (power spectrum) was RCf) l/v14 but that had no higher order correlations. This particular RCf) in one dimension was an arbitrary choice that happened to give clear results. By itself, however, this ensemble is not realistic since unlike real sensory signals it is noise free. So we added to the signal Si noise with flat power spectrum IN(f)l’ = V . This type of noise signal Ni is already spatially decorrelated having autocorrelator ( N j N j ) 6i,p Adding the random noise signal Ni also happens to serve a significant function by increasing the stability of the algorithm 3.4, although this might seem anti-intuitive. The source of the instability is that W-’ appearing in 2.8 (even though we implement 3.4)can eventually blow up if there are any very small eigenvalue modes in R, since ultimately (W . WT)-’ -+ R-’ according to 2.11. This small eigenvalue problem can be accentuated during the learning process while W . WT is only approximately equal to R, so during learning W may have even smaller
-
-
Algorithm for Sensory Receptive Field Development
55
eigenvalues than it does once it converges. Adding noise with a flat spectrum eliminates this small eigenvalue problem by adding a constant to all the eigenvalues. In frequency space Rcf) + Rcf) + N2 so the eigenvalues can never get too small. In practice any nonvanishing noise N2 assures stabilization. Thus the algorithm 3.4 applied to any realistic problem-which always has nonvanishing noise-will continue to be convergent. In our one-dimensional simulations we chose a signal to noise of four. The parameter 7 in 3.4 was chosen to be much larger than the characteristic time scale of the ensemble so that 3.4 could approximate the convergent algorithm 3.3. A general lower bound on 7 is not possible since it is ensemble dependent and hence requires experimentation. We find that smooth convergence does require a relatively large T compared to the size of the terms on the right-hand side of 3.4. In our onedimensional simulations we chose T = 5000. Finally, we used the initial condition W, = S, that as discussed in Section 3 is sufficient to eliminate the U symmetry. This then turns out to produce spatially local receptive fields W-I as shown in Figure 2, which exhibits the converged solution for the 32nd neuron, W ,;; after 20,000 iterations. Having demonstrated the viability of 3.4 through a one-dimensional example, the next step is to apply 3.4 to a two-dimensional ensemble with statistical properties close to those of a natural visual environment. Field (1987) argued, based on some experimentation with "natural" scenes, that natural environments have an approximately scale invariant spatial autocorrelator, which is equivalent to having a spatial power spectrum proportional to lfl-2 for two-dimensional frequencies f. We have shown previously (Atick and Redlich 1992) that the retinal ganglion cell kernel W-' acts to decorrelate the environmental signal-at least at low frequencies where the signal to noise is high-which in frequency space means W-'(f) If1 so that I W(f)I2 = R(f). Here, the goal of the two-dimensional simulation is to demonstrate that this decorrelating kernel can be learned by a network implementing the developmental rule 3.4. In two dimensions (2D) we generated an ensemble with power spectrum and without higher order statistics. We again added random noise for a SIN = 4, chose the initial condition W = 1, and found that T = 20,000 was sufficiently large to ensure smooth convergence. There was one major difference, however, between our one- and twodimensional simulations that was necessitated by the need to reduce computation time in the two-dimensional case. In 2 0 it was computationally necessary to assume translation invariance of the solution WV, meaning assuming in advance that W,will take the form Wij = W( li - j l ) after convergence [note that the index i in 2 0 now denotes the twodimensional vector i = (&, i,,)]. This is a valid assumption since, as discussed in Sections four and five, the initial condition W = 1 implies W = WT,which in turn implies translation invariance of W, as long as the ensemble is chosen to be translation invariant. Restricting W to N
56
Joseph J. Atick and A. Norman Redlich
Figure 2: The one-dimensional receptive field, WG;,for the 32nd neuron after the algorithm converges Cj running along the horizontal axis with 64 the total number of neurons). The receptive fields of other neurons are very similar. The dashed line is the autocorrelator (SiS32) averaged over the training data, while the solid curve is (W wT)j32,which is very close to the actual autocorrelator. The training was done with 20,000 iterations, T = 5000 and S I N = 4 starting with the initial condition Wij(f= 0) = 60. Figre 3: Facing page. Typical two-dimensional images used in training the network in 20. Image B differs from image A only by the addition of a small amount of noise. These images have the property that their power spectrum is identical to that of natural scenes, namely l/(fI2. We have trained networks using these images instead of actual scenes for convenienceonly. The algorithm should converge to similar solutions if trained on actual natural scenes as long as the power spectrum is the same.
Algorithm for Sensory Receptive Field Development
b
57
58
Joseph J. Atick and A. Norman Redlich
be translationally invariant in advance is computationallyvaluable since it reduces the number of computations from order n2 to n, where n is the number of input signals (number of photoreceptors), which for our simulation was n = 16 x 16. We should emphasize, though, that this restriction was purely for simulation purposes, and in an actual neural implementation of 3.4 there would be no need for it. The result of the two-dimensional simulation following convergence, which took about 200,000 steps, is shown in Figure 4. It has the generic center-surround property of a ganglion cell, but does not correspond exactly to a ganglion cell receptive field. That is because it is a purely decorreluting kernel, whereas the true ganglion cell kernel both decorrelates and filters noise, as shown in Atick and Redlich (1992). In that paper we demonstrated that the total retinal kernel can be obtained by first low pass filtering the noisy input signal and then decorrelating the result. Here we have demonstrated that 3.4 can perform the decorrelation step. Therefore if one first low pass filters in exactly the same way as in Atick and Redlich (1992) (where we were careful to include the appropriate transmission noise following the lowpass stage) and then applies 3.4, one must arrive at the same realistic ganglion cell kernels found in that paper, since this effectively reproduces the steps outlined there. We also expect that applying 3.4 to natural scenes, since they have the same spectrum as our simulation scenes, would likewise produce results close to those shown here. In a two-stage process in which noise is filtered and then the signal is decorrelated, it is also possible to derive the noise filter separately using a different learning algorithm. For example in Atick and Redlich (1991) we gave a convergent developmental rule for learning least mean square noise smoothing. (Such noise filtering algorithms differ from decorrelation algorithms such as 3.4 in that they require sufficient supervision to distinguish signal from noise.) Therefore, both noise filtering and decorrelation can be achieved developmentally in a two-step process in which first noise filtering and then decorrelation are learned. There are also information theoretic formalisms in which both noise filtering and decorrelation can be achieved through minimizing (or maximizing) a single information theoretic quantity (Atick and Redlich 1990; Linsker 1988). If a developmental algorithm could be found to perform this minimization (or maximization) then both goals, noise filtering and decorrelation, could be achieved through one learning stage. Linsker (1991) has made some progress in this direction, though his learning algorithm still requires a two-phase process. We recently became aware of the Ph.D. thesis of Mark Plumbley (Engineering Dept., Cambridge University), where a proof of convergence for another decorrelating algorithm is presented, using a similar approach.
Algorithm for Sensory Receptive Field Development
59
Figure 4: The receptive field found by the algorithm for the central neuron in two-dimensional 16 x 16 array of neurons. The receptive fields of other neurons are very similar. The training was done with 200,OOO iterations, T = 20, OOO, SIN = 4, and W = 1 using images like those shown in Figure 3.
Acknowledgments We would like to thank K. Miller for useful discussions and the Seaver Institute for its support. References Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Neural Comp. 2, 308-320. Atick, J. J.,and Redlich, A. N. 1991. Predicting ganglion and simple cell receptive field organizations. Int. J. Neural Syst. 1, 305-315. Atick, J. J.,and Redlich, A. N. 1992. What does the retina know about natural scenes? Neural Comp. 4, 196-210. Barlow, H.B. 1989. Unsupervised learning. Neural Comp. 1,295-311. Barlow, H . B., and Foldiak P. 1989. The Computing Neuron. Addison-Wesley, New York. Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. 1989. Finding minimum entropy codes. Neural Comp. 1,412423.
60
Joseph J. Atick and A. Norman Redlich
Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc.Am. A 4,2379-2394. Foldiak, P. 1989. Adaptive network for optimal linear feature extraction. Proc. IEEEIAVNS Int. Joint Conf. Neural Networks, Washington, DC,Vol. 1,401405. Goodall, M. C. 1960. Performance of stochastic net. Nature (London) 185, 557558. Hopfield, J. J. 1991. Olfactory computation and object perception. Proc. Natl. Acad. Sci. U.S.A. 00, 64624466. Kohonen, T., and Oja, E. 1976. Fast adaptive formation of orthogonalizing filters and associativememory in recurrent networks of neuron-like elements. Biol. Cybem. 21,85-95. Linsker, R. 1986. From basic network principles to neural architecture: Emergence of spatial-opponent cells. Proc. Natl. Acad. Sci. U.S.A. 83,508-512. Linsker, R. 1988. Self-organization in a perceptual network. Computer 21, (March), 105-117. Linsker, R. 1991. Talk at the 1991 meeting for the Society for Neuroscience. Oja, E. 1982. A simplified neuron model as a principal component analyzer. J. Math. Biol. 15, 267-273. Rubner, J., and Schulten, K. 1990. Development of feature detectors by selforganization. Biol. Cybern. 62,193-199. Sanger, T. D. 1989. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks 2,459-473. Received 21 January 1992; accepted 12 May 1992.
This article has been cited by: 2. Martin T Wiechert, Benjamin Judkewitz, Hermann Riecke, Rainer W Friedrich. 2010. Mechanisms of pattern decorrelation by recurrent neuronal circuits. Nature Neuroscience 13:8, 1003-1010. [CrossRef] 3. Stuart D. Wick, Martin T. Wiechert, Rainer W. Friedrich, Hermann Riecke. 2010. Pattern orthogonalization via channel decorrelation by adaptive networks. Journal of Computational Neuroscience 28:1, 29-45. [CrossRef] 4. Simone Fiori . 2005. Nonlinear Complex-Valued Extensions of Hebbian Learning: An EssayNonlinear Complex-Valued Extensions of Hebbian Learning: An Essay. Neural Computation 17:4, 779-838. [Abstract] [PDF] [PDF Plus] 5. Y.C. Eldar, A.V. Oppenheim. 2003. MMSE whitening and subspace whitening. IEEE Transactions on Information Theory 49:7, 1846-1851. [CrossRef] 6. Eizaburo Doi , Toshio Inui , Te-Won Lee , Thomas Wachtler , Terrence J. Sejnowski . 2003. Spatiochromatic Receptive Field Properties Derived from Information-Theoretic Analyses of Cone Mosaic Responses to Natural ScenesSpatiochromatic Receptive Field Properties Derived from Information-Theoretic Analyses of Cone Mosaic Responses to Natural Scenes. Neural Computation 15:2, 397-417. [Abstract] [PDF] [PDF Plus] 7. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 8. Alexander Dimitrov , Jack D. Cowan . 1998. Spatial Decorrelation in Orientation-Selective Cortical CellsSpatial Decorrelation in Orientation-Selective Cortical Cells. Neural Computation 10:7, 1779-1795. [Abstract] [PDF] [PDF Plus] 9. Jim Kay, W. A. Phillips. 1997. Activation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual GuidanceActivation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual Guidance. Neural Computation 9:4, 895-910. [Abstract] [PDF] [PDF Plus] 10. S.C. Douglas, A. Cichocki. 1997. Neural networks for blind decorrelation of signals. IEEE Transactions on Signal Processing 45:11, 2829-2842. [CrossRef] 11. Anthony J. Bell , Terrence J. Sejnowski . 1995. An Information-Maximization Approach to Blind Separation and Blind DeconvolutionAn Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7:6, 1129-1159. [Abstract] [PDF] [PDF Plus] 12. A. Norman Redlich . 1993. Supervised Factorial LearningSupervised Factorial Learning. Neural Computation 5:5, 750-766. [Abstract] [PDF] [PDF Plus]
Communicated by Dana Ballad
Three-Dimensional Object Recognition Using an Unsupervised BCM Network The Usefulness of Distinguishing Features Nathan Intrator Joshua I. Gold B m University, Providence, RI 02912 USA We propose an object recognition scheme based on a method for feature extraction from gray level images that corresponds to recent statistical theory, called pmjection pursuit, and is derived from a biologically motivated feature extracting neuron. To evaluate the performance of this method we use a set of very detailed psychophysical three-dimensional object recognition experiments (Biilthoff and Edelman 1992). 1 Introduction A system that performs recognition of three-dimensional (3D)objects in visual space must transform a complex pattern of visual inputs to an appropriate categorization. Such recognition is possible, for example, by template matching once the object and its templates are brought into register (Ullman 1989). Other similar schemes (Lowe 1986; Thompson and Mundy 1987)base the recognition on viewpoint consistency, which relate projected locations of key features of a model to its 3D structure given a hypothesized viewpoint. The regularization network or HyperBF interpolation scheme (Poggio and Edelman 1990; Poggio and Girosi 1990) represents 3D objects by sets of two-dimensional (2D)views using vectors of key-feature locations and regards generalization from familiar to novel views as a problem of nonlinear hypersurface interpolation in the space of all possible views. All these methods rely on the ability to find key features in the objects and, in some cases, to solve the correspondence problem between them.' However, sometimes these tasks can be as difficult as the recognition itself. In this paper, we propose an object recognition method that does not rely on finding such key features a priori. Instead, a transformation is sought that reduces the pixel image representations into a lowdimensional space from which nonlinear hypersurface interpolation can 'Edelman and Weinshall (1991) used the vertices without solving the correspondence problem between them.
Neural Computation 5,61-74 (1993)
@ 1993 Massachusetts Institute of Technology
64
Nathan Intrator and Joshua I. Gold
a priori an ordered list of vertices from the image and using a generalized radial basis function classification scheme (Moody and Darken 1989; Poggio and Girosi 1990, GRBF). This method classified lists of vertices based on their orientation within a vector space defined by the vertex sets of known objects; it achieved close to human performance in generalizing to novel views of the wires. The performance reflected a strong focus on the classification technique, and assumed a deterministic, a priori feature extraction. We, on the other hand, want to concentrate on the examination of the properties of our proposed feature extraction method and therefore in this study have chosen to use a classical, well-known classifier, based on the k-nearest-neighbor-rule5 (see for example, Duda and Hart 1973). In addition to the type of classifier used, the recognition paradigm with which the system is tested is a vital component in determining the usefulness of the features extracted. In the following sections we present an application of the BCM model to a set of specific 3D object recognition problems. The experiments chosen fulfill two important criteria: (1) they test the model's abilities to both recognize and generalize across a wide range of difficulties, and (2) these same studies have been used to test the abilities of not only computational models, but also human subjects; the psychophysical results in fact serve as benchmarks for this study. 3.1 Previous Studies. Bulthoff and Edelman (1992) developed and used wire-like objects in their experiments, in an effort to simplify the problem for the feature-extractorby providing little or no occlusion of the key features from any viewpoint. The wires consisted of seven connected segments, each pointed in a random direction but with its vertices distributed normally around the origin. Each experiment consisted of two phases, training and testing. In the training phase subjects were shown the target object from two standard views, located 75" apart along the equator of the viewing sphere. The target oscillated around each of the two standard orientations with an amplitude of 335" about a fixed vertical axis, with views spaced at 3" increments (see Fig. 1). Test views were located either along the equator-n the minor arc bounded by the two standard views (INTER condition) or on the corresponding major arc (EXTRA condition)--or on the meridian passing through one of the standard views (ORTHOcondition). Testing was conducted according to a two-alternative forced choice (2AFC) paradigm, in which subjects were asked to indicate whether the displayed image constituted a view of the target object shown during the preceding training session. Test images were either unfamiliar views of the training object or random views of a distractor (one of a distinct set of objects generated by the same procedure). 5Very similar classification results were obtained using a backpropagation classifier. In a forthcoming article, performance of backpropagationand radial basis function (RBF) classifiers will be compared using features extracted by the above feature extraction method.
Object Recognition Using a BCM Network
65
Figure 1: The training and testing experimental paradigm. A number of interesting characteristics of human visual object recognition abilities emerged from the psychophysical experiments. Generalization over orientations lying between two sets of known views-the INTERcondition-resulted in, on average, significantly fewer errors than with the other two extrapolation conditions. In addition, error rates increased steadily as the testing views moved farther away from the learned views, until recognition was near chance levels at large displacements. Finally, there were indications for a “horizontal bias,” so that error rates were lower when generalization was required along the horizontal, as opposed to the vertical, plane.
3.2 Experimental Paradigm. In the first part of the study, the network was tested on a 63 by 63 array of 8-bit gray-scale values with a paradigm nearly identical to the one used in the psychophysical investigation (Edelman and Bulthoff 1991). The procedure was modified slightly in that training was performed with two wires, since the k-NN classifier would yield meaningless results if trained on only a single wire. In the second part of the study, simple yes/no recognition was upgraded to a more difficult classification task involving six separate wires. The modification was necessary to fully test thFBCM model’s ability to extract the most salient rotation-invariant features from the images. Specifically, since BCM neurons explicitly search for differentiating features (due to the search for multimodality in the projected distribution),
Nathan Intrator and Joshua I. Gold
66
many cases involving only two distinct sets of inputs can be solved yith ”features” corresponding to prototypical views of each wire. In these cases, the two sets of wire-views, corresponding to the two wires, would form two distinct clusters in feature space. However, such differentiation would be much more difficult with a larger number of wires, and therefore the BCM network neurons would be forced to find projections that correspond to individual, rotation-invariant features, not prototypical views of individual wires. In addition, the model was modified in an attempt to account for the asymmetic psychophysical results. In the psychophysical experiments, the horizontal bias was found when humans were given the exact same paradigm as described above, except the objects were rotated 90” so that the training axis was aligned vertically, not horizontally. One possible explanation of such asymmetry is in increased resolution at the object representation level, namely, due to the fact that behaviorally, humans spend more time rotating around a vertical axis (i.e., rotation in a horizontal plane). This is experimentally equivalent to having more patterns rotated in a horizontal than in a vertical plane. This possibility has been eliminated in the careful psychophysical experiment performed by Edelman and Biilthoff (19911, in which subjects are provided identical experience with horizontal and vertical training. The continued existence of the bias under such conditions implicates an internal mechanism. We hypothesized greater a riori resolution in the infernal representation along the horizontal plane. More specifically, we set the ratio between the resolution in the horizontal plane and that in the vertical plane (the aspect ratio) to be 2/1 for “normal” training in the horizontal plane; conversely, training in the vertical plane was, from the point of view of the network, equivalent to setting the aspect ratio to be 1/2. Prediction of simulation performance due to this asymmetrical resolution is not straightforward since there are two contradictory effects. On the one hand, decreased resolution in the vertical plane means reduced disparity from rotations along that plane and therefore possibly better performance. On the other hand, there may also be improved performance in the horizontal axis since higher resolution will emphasize features that are rotation invariant along that direction. !
F
4 Results
The six wires used in the experiments are depicted in Figure 2. Different views of three of the wires are depicted in Figure 3. When only two wires were used (experiment one) the features extracted correspond almost exclusively to a single view of a whole image of one of the wires. “Thereis, in fact, limited evidence for visual field elongation in the horizontal plane (Hughes 19m.
Object Recognition Using a BCM Network
67
Figure 2: The six wires from a single view.
Figure 3 Different views (15' apart) of a single wire; top-to-bottom are INTER, EXTRA,and ORTHO. In contrast, when the task was recognition of six wires the extracted features emphasized small patches of several images or views, namely, areas that either remain relatively invariant under the rotation performed during training or represented a major differentiating characteristic of a specific wire (Fig. 4). A typical result is a set of weights that may correspond to a single wire but emphasizes small patches of the object and selectively inhibits selected areas which correspond to invariant locations of adjacent wires. For example, the top left image of Figure 4 largely r e p resents object number 5 in Figure 2 with additional inhibition from other objects, while the top right image or the bottom second from the right image exhibits weights related to several imagedviews. Classification results demonstrate the usefulness of the extracted features: generalization in the INTERorientations resulted in consistently
Nathan Intrator and Joshua I. Gold
68
Figure 4 Rotation invariant features for tublike objects extracted using a network of seven BCM neurons trained on six t u b l i k e objects. White areas ~ p resent strong synaptic weights, black areas represent negative synaptic weights (inhibition).
low error rates-around 15%(in which the chance error rate on this six wire experiment is 83.3%&which indicates that the features extracted by the BCM network could generalize well in those new views.' Furthermore, the results are comparable to those obtained in the psychophysical experiments. First, INTER recognition resulted in, on average, significantly fewer errors than with the other two extrapolation conditions. Second, error rates increased steadily as the testing views moved farther away from the learned views, until recognition was near chance levels at large displacements. These results are analogous to the ones shown in Figure 5 in which the aspect ratio is 2/1. Taken together, Figures 5 and 6 demonstrate a horizontal bias as seen in the psychophysical studies. When aspect ratio is 0.5, which corresponds in our model to training on rotations in the vertical plane, INTER performance is worse. This result suggests that finding specific rotation invariant features was harder in that case, given its lower resolution. On the other hand, there is no significant change in the performance of ExTRA and ORTHOorientations, suggesting that the extracted features were in both situations equally useful for EXTRAand ORTHOorientations. ~~
'Additional supporttothe usefulness of the extracted features to rotation invariant recognition is shown in sub uent work (Intratoret al. 1991; War et al. 1991) in which the extracted features are u a to occlude parts of the images and another network is trained to recognize the occluded images.
Object Recognition Using a BCM Network
69
-
0.8
cr
Inter
-
Extra
Q)
d
0.6
Ortho
&
0 &
____.
.........
0.4
w 0.2 0
10
0
20 30 40 Distance [Degl
50
60
Figun? 5 Fraction of misclassiflcation performance for wires trained on the horizontal plane.
o.8 4J
2
Oe6
-
0.4
-
0.2
-
&
0 &
.........
Ortho
.... .... ........
....
W
0
.......
Extra
a,
,/'
......... __.. ._..
F'
Figure 6: Fraction of misclassification performance for wires trained on the vertical plane. Note the degradation in performance in the INTERorientations.
Figures 7 and 8 show the results of the experimental paradigm testing the effect of additional experience during training in the horizontal
Nathan Intrator and JoshuaI. Gold
70
1 -
-
o.8
-
Extra
0)
c,
2
Inter
OS6
-
Ortho
0
10
+E+
U 0 U U
w
20 30 40 Distance [Degl
50
60
Figure 7 Fraction of dsclassification performance for wires trained on the horizontal plane with no asymmetry.
1
1
0.8 0.6
0.4 0.2 n
0
10
20 30 40 Distance [Deg]
50
60
Figure 8 Fraction of misclassification performance for wires trained with reduced training experience (views). plane! Both figures show results on training with an aspect ratio of 1, that is, no resolution asymmetry was used between the horizontal and 8Testing in both cases used the same number of patterns as in the previous experiments.
Object Recognition Using a BCM Network
71
vertical plane. In the experiments summarized in Figure 7, the same number of training views (experience) as in the previous set of experiments were used. In the experiments summarized in Figure 8, half as many training views were used. A number of interesting observations can be made. Results on the INTERcondition for an aspect ratio of 1 behave as can be predicted from the previous set of experiments; specifically, error rates were in between those of aspect ratios 2 and 0.5. EXTRAand ORTHOresults, however, were less noticeably affected, indicating that object resolution primarily affected the discovery of rotation invariant features to be used for recognition in the INTERcondition, as opposed to reducing overall recognition ability. Results from Figure 8, however, demonstrate a different effect. Reducing the number of training patterns, analogous to reducing the experience of vertical training, does not lead to an asymmetry in specific recognition conditions, but instead to a general decline in overall recognition ability. This suggests that reducing the number of training views in a model (without reducing the overall training angle rotation) does not simply affect the ability to extract rotation-invariant features for a particular recognition task. Instead, it degrades the ability of the model in overall feature extraction performance.
5 Discussion
This paper touches on issues of object representation. It is assumed that an object is internally represented by a particular combination of features. The nature of these features and the means for binding together the most important combination of features are still undetermined (Sejnowski 1986). We presented an unsupervised method for extracting features directly from gray-level pixel images, and we showed that a surprisingly small number of features is needed for a complex classification task. A comparison of our results to similar psychophysical experiments gives some indication that these features possess desired invariance properties that allow for overall classification performance that compares favorably with human performance. Extracting features from these gray-level images is a highly nontrivial statistical task. The dimensionality of this problem is 63 x 63 pixels; therefore, the curse of dimensionality implies that the number of training patterns should be immense, and yet from a small training set of 132 wires useful directions (projections)were extracted corresponding to features that were especially useful for rotation invariant recognition. This suggests that the BCM network may be a practical tool for gray-level image recognition in which internal low-dimensional feature representation emerges as a result of unsupervised training.
72
Nathan Intrator and Joshua I. Gold
Acknowledgments We wish to thank Heinrich Biilthoff, Shimon Edelman, and Leon Cooper for the encouragement and many fruitful conversations that have led to this paper. Dave Sheinberg, Phillipe Schyns, and Eric Sklar were invaluable for their help in using the AVSW system for getting the gray-level images. Finally, the excellent computational facilities of the Cognitive Science Department at Brown University allowed us to complete the simulations required for this project. Research was supported by the National Science Foundation, the Army Research Office, and the Office of Naval Research.
References Bear, M. F., and Cooper, L. N. 1988. Molecular mechanisms for synaptic modification in the visual cortex: Interaction between theory and experiment. In Neuroscience and Connectionist Theory, M. Gluck and D. Rumelhart, eds., pp. 65-94. Lawrence Erlbaum, Hillsdale, NJ. Bear, M. F., Cooper, L. N., and Ebner, F. F. 1987. A physiological basis for a theory of synapse modification. Science 237,4248. Bellman, R. E. 1961. Adaptive Control Processes. Princeton University Press, Princeton, NJ. Bienenstock, E. L., Cooper, L. N., and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. J , Neurosci. 2, 32-48. Bulthoff, H. H., and Edelman, S. 1992. Psychophysical support for a 2-D view interpolation theory of object recognition. Proc. Natl. Acad. U.S.A. 89,6044. Clothiaux, E. E., Cooper, L. N., and Bear, M. F. 1991. Synaptic plasticity in visual cortex: Comparison of theory with experiment. I. Neurophysiol. 66, 17si804. Duda, R. O., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Edelman, S. 1991. Features of recognition. CSTR 10, Weizmann Institute of Science. Edelman, S.,and Bulthoff, H. H. 1992. Orientation dependence in the recognition of familiar and novel views of 3D objects. Vision Res., in press. Edelman, S., and Poggio, T. 1992. Bringing the Grandmother back into the picture: A memory-based view of object recognition. J. Pattern Recog. Artif lntell. 6, 37-62. Edelman, S., and Weinshall, D. 1991. A self-organizingmultiple-view representation of 3D objects. Biol. Cybern. 64,209-219. Fisher, R. A. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7,179-188. Friedman, J. H. 1987. Exploratory projection pursuit. J. Am. Stat. Assoc. 82, 249-266.
Object Recognition Using a BCM Network
73
Friedman, J. H., and Tukey, J. W. 1974. A projection pursuit algorithm for exploratory data analysis. ZEEE Transact. Comput. C(231, 881-889. Gold, J. I. 1991. A model of dendritic spine head [Ca++]:Exploring the biological mechanisms underlying a theory for synaptic plasticity. Unpublished honors thesis, Brown University. Harman, H. H. 1967. Modern Factor Analysis, 2nd ed. University of Chicago Press, Chicago. Huber, P. J. 1985. Projection pursuit (with discussion). Ann. Statist. 13,435-475. Hughes, A. 1977. The topography of vision in mammals of contrasting live style: Comparative optics and retinal organisation. In The Visual System in Vertebrates,Handbookof Sensory Physiology VZI/5,F. Crescitelli,ed., pp. 613-756. Springer-Verlag, Berlin. Intrator, N. 1990. A neural network for feature extraction. In Advances in Neural Information Processing Systems, D. S. Touretzky and R. I? Lippmann, eds., Vol. 2, pp. 719-726. Morgan Kaufmann, San Mateo, CA. Intrator, N. 1992. Feature extraction using an unsupervised neural network. Neural Comp. 4,98-107. Intrator, N., and Cooper, L. N. 1992. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks 5,3-17. Intrator, N., Gold, J. I., Bulthoff, H. H., and Edelman, S. 1991. Three-dimensional object recognition using an unsupervised neural network Understanding the distinguishing features. In Proceedings ofthe 8th Israeli Conferenceon AICV, Y. Feldman and A. Bruckstein, eds., pp. 113-123. Elsevier, Amsterdam. Jones, M. C., and Sibson, R. 1987. What is projection pursuit? (with discussion). J. R. Statist. Soc. A(150), 1-36. Kruskal, J. B. 1969. Toward a practical method which helps uncover the structure of the set of multivariate observations by finding the linear transformation which optimizes a new 'index of condensation.' In Statistical Computation, R. C. Milton and J. A. Nelder, eds., pp. 427-440. Academic Press, New York. Lowe, D. G. 1986. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, Boston, MA. Moody, J., and Darken, C. 1989. Fast learning in networks of locally tuned processing units. Neural Comp. 1,281-289. Poggio, T., and Edelman, S. 1990. A network that learns to recognize threedimensional objects. Nature (London) 343,263-266. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. IEEE PYOC.78(9), 1481-1497. Sebestyen, G. 1962. Decision Ahking Processes in Pattern Recognition. Macmillan, New York. Sejnowski, T. J. 1986. Open questions about computation in Cerebral Cortex. In Parallel Distributed Processing, J. L. McClelland and D. E. Rumelhart, eds., Vol. 2, pp. 372-389. MIT Press, Cambridge, MA. Sklar,E., Intrator, N., Gold, J. J., Edelman, S. Y., and Biilthoff, H. H. 1991. A hierarchical model for 3D object recognition based on 2D visual representation. Neurosci. Soc.Abstr.
74
Nathan Intrator and Joshua I. Gold
Thompson, D. W.,and Mundy, J. L. 1987. Three-dimensional model matching from an unconstrained viewpoint. In Proceedings of lEEE Conference on Robotics and Automation, pp. 208-220. Raleigh, NC. Ullman, S. 1989. Aligning pictoral descriptions: An approach to object recognition. Cognition 13, 13-254. Received 11 July 1991; accepted 14 May 1992.
This article has been cited by: 2. Eric E. Cooper, Brian E. Brooks. 2004. Qualitative Differences in the Representation of Spatial Relations for Different Object Classes. Journal of Experimental Psychology: Human Perception and Performance 30:2, 243-256. [CrossRef] 3. Y. Dotan, N. Intrator. 1998. Multimodality exploration by an unsupervised projection pursuit neural network. IEEE Transactions on Neural Networks 9:3, 464-472. [CrossRef] 4. Q.Q. Huynh, L.N. Cooper, N. Intrator, H. Shouval. 1998. Classification of underwater mammals using feature extraction based on time-frequency analysis and BCM theory. IEEE Transactions on Signal Processing 46:5, 1202-1207. [CrossRef] 5. Shimon Edelman . 1995. Representation of Similarity in Three-Dimensional Object DiscriminationRepresentation of Similarity in Three-Dimensional Object Discrimination. Neural Computation 7:2, 408-423. [Abstract] [PDF] [PDF Plus] 6. Nathan Intrator . 1993. Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural NetworksCombining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks. Neural Computation 5:3, 443-455. [Abstract] [PDF] [PDF Plus]
Communicated by John Moody
Complexity Optimized Data Clustering by Competitive Neural Networks Joachim Buhmann' Lawrence Livermore National Laboratory, Computational Physics Division, P. 0.Box 808, L-270, Livermore, C A 94550 U S A
Hans Kiihnel Physik Department, T35, Technische Universitiit Miinchen, Boltzmannstrafle, 0-8046 Garching, Germany
Data clustering is a complex optimization problem with applications ranging from vision and speech processing to data transmission and data storage in technical as well as in biological systems. We discuss a clustering strategy that explicitly reflects the tradeoff between simplicity and precision of a data representation. The resulting clustering algorithm jointly optimizes distortion errors and complexity costs. A maximum entropy estimation of the clustering cost function yields an optimal number of clusters, their positions, and their cluster probabilities. Our approach establishes a unifying framework for different clustering methods like K-means clustering, fuzzy clustering, entropy constrained vector quantization, or topological feature maps and competitive neural networks. 1 Introduction
Natural and artificial information processing systems in vision, speech recognition and telecommunication rely on data compression and data recoding techniques either to process large amounts of data efficiently or to discard noise and to reveal the underlying structure in a data set (Linde et al. 1980; Bezdek 1980; Rose et al. 1990; Chou et al. 1989;Kohonen 1984; Rumelhart and Zipser 1985). In situations where we know the nature of the data source or at least a set of promising models for it, the data clustering task amounts to an estimation of the underlying probability density with Bayesian learning (Duda and Hart 1973) being the appropriate method. Bayesian classification theory (Hansen et al. 1991) as implemented in the autoclass system is a successful example of this approach. In the absence of a parametric model for the data, however, the 'Present address: Rheinische Friedrich-Wilhelms-Universitat, Institut fiir Informatik 11, Romerstrafe 164, D-5300 BOM 1, Germany.
NeurnI Computation 5, 75-88 (1993)
@ 1993 Massachusetts Institute of Technology
76
Joachim Buhmann and Hans Kuhnel
data clustering problem can be formulated as an optimization problem of a suitable objective function which preserves the original data as complete as possible. Distortion errors that result from the clustering process have to be limited to a minimum. Furthermore, optimizing a clustering cost function is the strategy of choice if we have to observe additional constraints imposed on the clustering solution by the information processing application considered. Such constraints are often not related to the data source and, therefore, would not influence a clustering solution based on the probability density estimation of the data source. Clustering strategies derived from an optimization principle are also known as vector quantization (Linde et al. 1980). The most prominent algorithms are the MacQueen K-means clustering algorithm (MacQueen 1967) and the LBG vector quantization algorithm by Linde et al. (1980). One central issue of this paper is to supplement the usual distortion measure with a complexity term that penalizes too complex clustering solutions. We discuss an objective function for data clustering that compromises between distortion errors and the complexity of a reduced data representation. The joint optimization of these two cost terms determines an optimal number of cluster centers. A maximum entropy estimation of the cluster assignments allows us to derive batch and online versions of the resulting algorithms for a number of different distortion and complexity measures. The close analogy of complexity optimized clustering with winner-take-all neural networks suggests a neural-like implementation resembling topological feature maps.* 2 The Objective Function-Choices
for Distortion and Complexity
costs Approaches to represent a data set { x i I xi E @; i = 1 , . . . ,N}by a reduced set of data prototypes {y, I y, E @; a = 1,. . . ,K} based on a distortion cost function arise in a large variety of data processing applications. The primary objective in these applications is to determine an appropriate number K of clusters and to find their centers y, and their cluster probabilities pa. The naming convention for the parameters y, is not unique; therefore, we will interchangeably use the terms “cluster center,” “prototype,” or “reference vector” for y,. The assignments {Mi, I a = 1,. . . ,K;i = 1,.. . ,N},Mi, € (0, I} of data point X i to cluster a are chosen such that the residual distortion error ’ D i , ( X i r y,) between data point xi and cluster center y, is minimized and that the resulting complexity of the cluster set is limited. Mi, = 1 denotes that data point X i is uniquely assigned to cluster a, which implies the uniqueness constraint Cfj=lMi, = 1. The cluster probability of a cluster cr is d e ZTheidea to implement clustering solutions by winner-take4 networks has been discussed by various authors [e.g.,see Kohonen (1984) and Rumelhart and Zipser (1985) for a historical perspective].
Complexity Optimized Data Clustering
77
xzl
fined as p , = Mi,/N. The most frequently used distortion measures are powers of the Euclidean distance between xi and y,, for example, Di, = [[Xi - y,1['. However, a Minkowski lp metric
might be preferable for application with a %oxy" data structure, that is, data sets generated by distributions with sharp edges. Rate distortion theory (see Cover 1991) specifies the optimal choice of y,, that is, it has to be the centroid y, of a cluster a, y, being defined by the centroid condition ciMi,(d/dy,)Di,(xi,Ya) = 0. The complexity C, of cluster a depends on the specific information processing application at hand, in particular, we assume that C, is a function of only the cluster probability pa. The proposed clustering cost function N
X
(2.1)
compromises between distortion costs and complexity costs, thereby determining an optimal number K of clusters. Variation of {Mi,} implicit1 determines the cluster centroid y, and the cluster probability p , = Mi,/N. X is a weighting parameter required to adjust the complexity cost to the scale of the distortion costs. The cost function 2.1 has to be optimized in an iterative fashion: (1) vary the assignment variables Mi, for a fixed number K of clusters such that the costs &K({M~,}) decrease; (2) increment the number of clusters K -+ K + 1 and optimize Mi, again. The optimized cluster parameters y,, pa, and K are determined by the configuration with minimal costs. Note that configurations with degenerate clusters, that is, y, = yp for a # /3 have to be rejected since they cause an overestimation of the "true" complexity of a cluster set. The well-known K-means clustering algorithm (Linde et al. 1980) corresponds to the case X = 0 and K fixed a priori. An application-dependentcomplexity could be the mass storage space of a storage medium, the processing hardware in electronics, the number of neurons in brains, or the channel bandwidth in communication systems. Complexity costs that penalize small, sparsely populated clusters, that is, C, = I/& s = 1,2, . . . , favor equal cluster probabilities, thereby emphasizing the hardware aspect of a clustering solution. The special case s = 1 with complexity costs strictly proportional to the number of clusters (C, = l/p, + Xi C, Mi,/p, = NK) defines a clustering principle which is functionally equivalent to the K-means clustering although we follow the heuristics to increase the number of clusters incrementally. Biological systems certainly favor such a load-balancing strategy for data processing since they have to employ neurons for information processing, which creates energetic and physiological costs for living brains.
x
Joachim Buhmann and Hans Kiihnel
78
A complexity measure that provides a lower bound on encoding costs in the context of data compression and bandwidth limited data transmission is the Shannon entropy of a cluster set (C) = - C, pa logp, (C, = - logp,) and was used by Chou et al. (1989) for the design of an entropy constrained vector quantizer. (C) measures the minimum average number of bits necessary to uniquely encode a given cluster set. The choice of an appropriate distortion measure DiU(xi,y,) is another degree of freedom in the design of a clustering algorithm. The most commonly used distortion costs are distance measures Di, = IIXi-yoll', which preserve the permutation symmetry of 2.1 with respect to the cluster index a. We will now discuss the relationship of our approach to a class of clustering algorithms with a preference for topological arrangements of clusters. Luttrell(1989) discovered in his studies of hierarchical vector quantizers that a fictitious noise process in the communication channel between sender and receiver can break the permutation symmetry and favors a particular topology over other possible cluster arrangements. This picture that is sketched in the block diagram below establishes a connection to a class of topological vector quantization algorithms known as self-organizing feature maps (Kohonen 1984; Ritter et al. 1992). The same idea of coding schemes that are robust against channel noise has been discussed in the literature on vector quantization as source-channel-coding (Farvardin 1990).
I noise T,, I {Xi}+
1
1-
-+a+
pGEq
+
y+
1-1
+
{y,}
At the encoding stage data point xi is assigned to cluster a. The index a, however, gets corrupted by channel noise and arrives as index y at the receiver's side of the communication channel. Let us denote the transition probability from index a to index y by T a y , C!=, T,, = 1. The receiver, consequently, reconstructs the "incorrect" cluster center y, as a representation of xi instead of the "correct" reference vector y,. The additional distortions due to the channel noise have to be taken into account when we estimate the most likely assignment variables and, thereby, the centroid positions and the cluster probabilities. The average clustering costs of the data set {xi} are
The average distortion error ((Dia)= C, TuyDi,(xi,y,) between data point xi and cluster center yU quantifies the intrinsic quantization errors and the noise induced errors. The centroids y, are defined by N
K
n
Complexity Optimized Data Clustering
79
The noise process with its transition probabilities T, induces a topology onto the set of clusters, for example, a tridiagonal matrix ,T = 1 - v;T,,,il = v/2; T, = 0 Vla - yI > 1 (see Fig. 2) defines a linear chain with nearest-neighbor transitions. Note, that the structure of the clustering cost functions 2.1 and 2.2 is the same. We have introduced only a generalized distortion measure ((Dial in the topological case. Topology preserving clustering reduces to nontopological clustering if we identify ,T = ,S with ,S being the Kronecker symbol (S, = 1 if (I: = 7; ,S = 0 if a # y). The clustering schemes based on 2.1 or 2.2 do not take any a priori knowledge about data classes into account. Many applications, however, provide information how to partition the training data into a set of classes y. Given such a priori class information for the training data, it is desirable that clusters do not overlap with class boundaries, for example, data belonging to different classes should not be assigned to the same cluster. This knowledge about the class membership Fir E [0,1], C, Fir = 1Vi of data point i in class y allows us to supervise a clustering process. The costs associated with supervised clustering depend on the joint probability pra = CiFi,Mi,/N of a data point out of class y being assigned to cluster a. The uniqueness of class membership implies the relation pa = &p,,. We introduce a new cost term S,, called supemision costs, that penalizes assignments of data from different classes to the same cluster. S, should be positive and it should vanish if the conditional probability p,la, that a data point assigned to cluster a is member of class y, approached certainty, that is, p,la = p,,/p, = 1. A supervision cost term that satisfies this condition is defined by the ”conditional class entropy” S, = - C, p,l, log(p,l,). Adding the supervision costs S, to the logarithmic complexity costs C, = -log@,) both cost terms being weighted equally (AS = AC = A) we arrive at the “supervised complexity costs” C; = - C, p , ~ , log(p,,) which penalizes too complex clustering solutions and includes a priori class information. As discussed in Bichsel and Seitz (1989) the ”conditional class entropy” as a penalty term favors a partitioning of the input space with clusters of unique class adherence, a very valuable property for any subsequent classification stage. The simplest functional form of S, that is polynomial in p,, with a root at P,l, = 1 is S, = C,P,l,(l - p,la). 3 Maximum Entropy Estimation of y, and p ,
Different combinations of complexity terms, topology constraints, and class information for supervised clustering define a variety of clustering algorithms that are relevant in very different information processing contexts. To derive robust, preferably parallel algorithms for these data clustering cases, we study the optimization problem 2.2 in a probabilistic framework (see Rose et al. 1990). This design philosophy for
Joachim Buhmann and Hans Kiihnel
80
optimization algorithms is motivated by the success of simulating annealing (Kirkpatrick et al. 1983) and of neural optimization algorithms (Durbin and Willshaw 1987; Yuille 1990; Simic 1990). The most likely distribution of assignment variables Mi, can be determined by the maximum entro y principle (Jaynes 1957), and equals the Gibbs distribution P ( { M } ) = exp(-PEK). The "computational temperature" T = 1/P, which plays the role of a Lagrange parameter for the average clustering costs, controls the randomness in the assignment process. The partition function Z normalizes the Gibbs distribution. Statistical physics (see, e.g., Amit 1989; Rose et al. 1990) states that maximizing the entropy at a fixed temperature is equivalent to minimizing the free energy
P
K
i
N
with respect to the variables paryurparyu (see Buhmann and Kuhnel 1992c for details). The auxiliary variables p a , ya are Lagrange parameters that enforce the constraints pa = Ezl Mi,/N and Ci ,&Mi7Tya(a/?a)Dia = 0, respectively. It can be shown that all minima of FK have ya I 0 and pa = Xpa(aCa/apa). The fact that ye vanishes3 is implied by the definition of ya as the centroid of cluster a. The resulting reestimation equations for the expected cluster probabilities and the expected centroid positions are (3.2) (3.3)
Equations 3.2 and 3.3 are necessary conditions of FKbeing minimal. We have identified pa, y. in 3.1-3.4 with their expectation values, which is 31n self-organizing feature maps yo is usually defined as the center of mass of the Voronoi cell a instead of the generalized centroid C i C 7 M i Y T f u ~ Z ) i o ( X i , y=u )0, which produces a nonvanishingconjugate field ya.
Complexity Optimized Data Clustering
81
justified in the large N limit. The expectation value (Mi,)of the assignment variable Mi, can be interpreted as a fuzzy membership of data point xi in cluster a. The fuzziness that is controlled by the temperature T sets a resolution limit below which different clusters cannot be discriminated. In the case of supervised clustering with the cost function 2.2 supplemented by supervision costs, the free energy generalizes to
&
The term Firpya = A c p a ( X a / d p a )+ Aspa C,"=Ir i y ( a & / @ y a ) enforces the constraint pya = & I'iyMi,/N. Equation 3.2 has to be generalized to a reestimation of the average joint probabilities pay, that is,
The reestimation equation for the cluster centroids has the same form as in 3.3 with (Mi,) given by 3.7. See Buhmann and Kuhnel (1992a) for a more detailed derivation of the equations 3.5-3.7. The global minimum of the free energy 3.1 with respect to p,,y, (y,, jj, already inserted) determines the maximum entropy solution of the cost function 2.1. Note that the optimization problem 2.1 or 2.2 of a KN state space has been reduced to a K(d+l)-dimensional minimization of the free energy FK(3.1 or 3.5). To find the optimal parameters p a , y, and the number of clusters K that minimize the free energy we start with one cluster located at the centroid of the data distribution, split that cluster, and reestimate p,, ya using equations 3.2 and 3.3. The new configuration is accepted as an improved solution if the free energy 3.1 has been decreased. This splitting and reestimation loop is continued until we fail to find a new configuration with lower free energy. An iterative splitting strategy is required because the cost function exhibits numerous local minima, especially since we compare solutions with different numbers of clusters. Cluster splitting in the topological case should be performed such that a new cluster does not violate the topology constraint severely,
Joachim Buhmann and Hans Kuhnel
82
~~
I
* *
+
*
* ' * * I * ** 1 . * . * * * *+* * c ' * 1 . * . * . * I * * +** * * * ' t
Figure 1: A data distribution (4000 data points) (a), generated by four normally distributed sources, is clustered with the complexity measure C, = -logpa. Two zero temperature solutions for X = 2.5,0.4 are shown in (b,c) where plus signs (+) denote the positions of the gaussians and stars (*) denote cluster centers.
that is, it should not give rise to a topological defect. The resulting configuration is stable under single cluster splitting, but we are not assured to find the global minimum. Note the difference between our approach to determine the number of clusters and the suggestion by Rose et al. (1990) to use the temperature as a control parameter. The temperature determines the fuzziness of a clustering solution, whereas the complexity term penalizes excessively many clusters. In the case of hard clustering (T + 0) we still like to limit the number of clusters by an application dependent complexity measure. Nontopological (T,? = 6,J clustering results at zero temperature for the logarithmic complexity measure (C, = - log p,) are shown in Figure 1. At high complexity costs (b) the algorithm finds four clusters located at the centers of the gaussians. In the limit of very small complexity costs (c) the best clustering solution densely covers the data distribution. The specific choice of logarithmic complexity costs causes an almost homogeneous density of cluster centers, a phenomenon that is explained by the vanishing average complexity costs (C,) = -pa logp, of very sparsely occupied clusters (see Gish and Pierce 1968; Buhmann and Kuhnel 1992~). Analytical results of the asymptotic (K large) cluster density for different complexity measures are discussed in Buhmann and Kiihnel (1992~). Figure 2 shows a clustering configuration assuming a onedimensional topology in index space with nearest-neighbor transitions. The short links between neighboring nodes of the neural chain indicate that the distortions due to external noise have also been optimized. Note that complexity optimized clustering determines the length of the chain or, for a more general noise distribution, an optimal size of the cluster
Complexity Optimized Data Clustering
83
Figure 2: Topology presetving clustering with C, = l/p,: a chain of 50 clusters covers the data distribution of Figure 1. The short average distances between consecutive clusters demonstrate that the total distortion error has been minimized, including the distortions induced by external noise (17 = 0.05).
set. This stopping criterion for adding new cluster nodes generalizes selforganizing feature maps (Kohonen 1984)and removes arbitrariness in the design of topological mappings. Furthermore, our algorithm is derived from an energy minimization principle in contrast to self-organizing feature maps, which “cannot be derived as a stochastic gradient on any energy function” (Erwin et al. 1992). The complexity optimized clustering scheme has been tested on the real-world task of image compression (Buhmann and Kuhnel 1992b). Entropy optimized vector quantization of wavelet decomposed images has reduced the reconstruction error of the compressed images up to 30%. Furthermore, the tendency of entropy optimized clustering to represent outlier regions with low data density yields psychophysicallymore pleasing image reconstructions since rare image features like edges are more faithfully encoded than by K-means clustering (Buhmann and Kuhnel 1992~). Figure 3 summarizes our experiments with supervised clustering controlled by the supervision costs Ci. To test the performance of supervised clustering compared to an unsupervised scheme as K-means clustering
84
Joachim Buhmann and Hans Kuhnel
Number of Clusters h’
Figure 3: Comparison of K-means clustering results (4 with supervised clustering ( 0 ) of a data set generated by a homogeneous two-dimensional data distribution with a disk shape ( ( ( x i / (_< 1). The data set (800data points) is divided into two classes of eight equally sized segments per class which alternate. Supervised clustering outperforms K-means clustering significantly for K 216 as indicated by the decrease of the average “conditional class entropy” (So).
we generated a disk-shaped data set with homogeneous density inside the disk llxill 5 1 and zero density outside. The data set was divided into two classes of eight segments per class with segments of different classes adjacent to each other. This artificial data set of two strongly intermingled classes allows us to benchmark the supervised clustering scheme with K-means clustering. Figure 3 shows that supervised clustering outperforms K-means clustering significantly for K 216 clusters as indicated by the decrease of the average “conditional class entropy” (So).
4 Online Clustering -
The described clustering procedures are butch algorithms since they require all data points {xi} to be available at the same time, a very restrictive constraint for applications dealing with a large data volume. Online clustering algorithms process a data stream sequentially by updating y, and pa iteratively. Such an algorithm is implemented by the first two layers of a competitive neural network with an architecture depicted in Figure 4. The third layer composed of classification units assigns data points to classes provided that class labels are available for the training data. The third layer is absent in unsupervised clustering schemes. The synaptic vectors y, define localized receptive fields of the clustering units
Complexity Optimized Data Clustering
85
a, which can be interpreted as radial basis function units. Learning after N - 1 data points have been processed results in the incremental changes
(4.1) (4.2)
The upper indices (N) and (N - 1) denote the estimates of pa, yo after XN and XN-~ have been processed, respectively, To derive equations 4.14.3we have expanded the equations 3.2, 3.6, and 3.3 u p to linear terms in AporApyo,Aya,keeping ( M N , ~fixed. ) All terms proportional to derivatives of ( M N , ~vanish ) in the hard clustering limit exponentially fast IO[exp(-c/T)/q, c > 0 as T + 01.
Figure 4 A three layer competitive neural network for data clustering with d units in the input layer, K units in the clustering layer, and G units in the classification layer implements the complexity optimized clustering algorithm. Unit a in the clustering layer receives activity from the input units weighted by the synaptic vector yo. The units in the clustering layer are connected by a winner-take-all network. Due to mutual competition each data point XN is assigned to exactly one cluster a in the case of hard clustering and to a small number of clusters for fuzzy clustering. The third layer calculates the conditional class probability of a data point i being in class 7. This layer is absent for unsupervised clustering.
Joachim Buhmann and Hans Kiihnel
86
In the case of squared Euclidean distortion costs (Dia = equation 4.3 reduces to the form
I1Xi
-~
~ 1 1 ~ ) (4.4)
which corresponds to MacQueen's update rule for K-means clustering (MacQueen 1967) and which is similar to learning in competitive neural networks. yiN-') is moved toward the most recent data point XN proportionally to the mismatch (XN - yLN-'))and proportional to XN'S effective membership C, T,,(MN,,) in cluster a. In addition to that, the update formula 4.4 weights any change in yiN)by the effective number of data points NC, T,,piN), which are already assigned to cluster a. The learning rate 1/(NC, T,,piN)) treats different clusters according to their history. That generalizes conventional topological feature maps which suggest the same learning rate for all clusters (Kohonen 1984). Is the learning rate c/(NC,T,,piN)) with c = 1 the fastest rate that guarantees the correct maximum entropy estimation of the cluster centers? This question of optimally efficient stochastic approximation of the clustering solution has received considerable attention recently, especially since there exists evidence from numerical simulation that fast adaptive K-means clustering (Darken and Moody 1990) and "search then converge" clustering schedules (Darken and Moody 1992) outperform the original MacQueen's update rule in speed. According to numerical and theoretical results there exists a critical c* below which convergence in the asymptotic limit N + 00 is dramatically slowed down (Darken and Moody 1992). Unfortunately, c* is unknown for clustering and the numerically supported conjecture that the optimal learning rate c"p' = c* > 1 has to be proved. Online optimization of the number of clusters K relies on a heuristics for cluster merging and cluster creation. We have explored the following heuristics for cluster creation: A data point xi initializes a new cluster K 1 with YK+I = Xi if the clustering costs by assigning xi to an already existing cluster exceeds the complexity costs of the new cluster K + 1, that is, CK+I < min,( llxi - y,ll' + XC,). We found in a series of clustering experiments that this strategy causes a slight overestimation of the number of clusters but the resulting cluster configurations have comparable cluster costs to configurations found in batch runs.
+
5 Conclusion
Complexity optimized clustering is a maximum entropy approach to unsupervised and supervised data clustering that determines the optimal number of clusters as a compromise between distortion errors and the complexity of a cluster set. The complexity term turns out to be as important for the design of a cluster set as the distortion measure. We have de-
Complexity Optimized Data Clustering
87
rived a batch and an online version of the proposed clustering algorithms. Complexity optimized clustering maps onto a winner-take-all network, which suggests hardware implementations in analog VLSI (Andreou et al. 1991; Lazzaro et al. 1989). Topology preserving clustering suggests a cost function based approach to limit the size of self-organizing maps. The proposed framework for data clustering unifies traditional clustering techniques like K-means clustering, entropy constraint clustering or fuzzy clustering with neural network approaches as topological vector quantizers. The network size and the cluster parameters are determined by a problem adapted complexity function, which removes considerable arbitrariness present in other nonparametric clustering methods. A related approach to clustering (Wong 1992) was recently brought to our attention. The clustering algorithm proposed by Wong is complementary to ours since it relies on cluster melting rather than cluster splitting.
Acknowledgments It is a pleasure to thank H. Sompolinsky, T. Tishby, and P.Tavan for helpful discussions. JB has been supported by the German Federal Ministry of Science and Technology and by the Air Force Office of Scientific Research (C. von der Malsburg, PI) while working at the Center for Neural Engineering of the University of Southern California. HK is a recipient of a graduate fellowship, Technical University Munich.
References Amit, D. 1989. Modelling Brain Function. Cambridge University Press, Cambridge. Andreou, A. G., Boahen, K. A., Pouliquen, P. O., Pavasovii., A,, Jenkins, R. E., and Strohbehn, K. 1991. Current mode subthreshold MOS circuits for analog VLSI neural systems. I E E E Transact. Neural Networks 2, 205-213. Bezdek, J. C. 1980. A convergence theorem for the fuzzy isodata clustering algorithms. I E E E Transact. Pattern Anal. Machine Intelligence 2(1), 1-8. Bichsel, M., and Seitz, P. 1989. Minimum class entropy: A maximum information approach to layered networks. Neural Networks 2, 133-141. Buhmann, J., and Kiihnel, H. 1992a. Unsupervised and supervised data clustering with competitive neural networks. IJCNN International Conference on Neural Networks, Baltimore, pp. N796-IV801. IEEE. Buhmann, J., and Kiihnel, H. 1992b. Complexity optimized vector quantization: A neural network approach. In Data Compression Conference, J. Storer, ed., pp. 12-21. IEEE Computer Society Press, Los Alamitos, CA. Buhmann, J., and Kiihnel, H. 1992c. Vector quantization with complexity costs. I E E E Transact. I n f o r m . Theory, in press. Chou, P. A., Lookabaugh, T., and Gray, R. M. 1989. Entropy-constrained vector quantization. I E E E Transact. Acoust. Speech Signal Process. 37, 31-42.
88
Joachim Buhmann and Hans Kuhnel
Cover, T. M.1991. Elements of lnformation Theory. Wiley, New York. Darken, C., and Moody, J. 1990. Fast adaptive k-means clustering: Some empirical results. lnternational Joint Conference on Neural Networks, Sun Diego, pp. I1 233-11 238. IEEE. Darken, C., and Moody, J. 1992. Towards faster stochastic gradient search. In Neural Information Processing Systems 4 . Morgan Kaufmann, San Mateo, CA. Duda, R. O., and Hart, P E. 1973. Pattern Classification and Scene Analysis. Wiley, New York. Durbin, R., and Willshaw, D. 1987. An analogue approach to the travelling salesman problem using an elastic net method. Nature (London) 326, 689691. Erwin, W., Obermayer, K., and Schulten, K. 1992. Self-organizingmaps: Ordering, convergence properties, and energy functions. Biol. Cybern. 67,47-55. Farvardin, N. 1990. A study of vector quantization for noisy channels. IEEE Transact.Inform. Theory 36(4), 799-809. Gish, H., and Pierce, J. N. 1968. Asymptotically efficient quantizing. IEEE Transact.Inform. Theory IT 14,676-683. Hanson, R., Stutz, J., and Cheeseman, I? 1991. Bayesian classification theory. Tech. Rep. FIA-90-12-7-01, NASA Ames Research Center. Japes, E. T. 1957. Information theory and statistical mechanics. Phys. Rev. 106, 620-630. Kirkpatrick, S., Gelatt, C., and Vecchi, M. 1983. Optimization by simulated annealing. Science 220, 671-680. Kohonen, T.1984. Self-Organization and Associative Memory. Springer, Berlin. Lazzaro, J., Ryckebusch, R., Mahowald, M. A., and Mead, C. A. 1989. Winnertake-all networks of o(n) complexity. In Neural Information Processing Systems I, pp. 703-711. Morgan Kaufmann, San Mateo, CA. Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantizer design. l E E E Transact. Commun. COM 28, 84-95. Luttrell, S. P.1989. Hierarchical vector quantisation. IEE Proc. 136,405-413. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297. Ritter, H., Martinetz, T., and Schulten, K. 1992. Neural Computation and Selforganizing Maps. Addison-Wesley, New York. Rose, K., Gurewitz, E., and Fox, G. 1990. Statistical mechanics and phase transitions in clustering. Pkys. Rev. Lett. 65(8), 945-948. Rumelhart, D. E., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9,75-112. Simic, P. 1990. Statistical mechanics as the underlying theory of "elastic" and "neural" optimizations. Network 1,89-103. Wong, Y. 1992. Clustering data by melting. Neural Cornp. 5(1), 89-104. Yuille, A. L. 1990. Generalized deformable models, statistical physics, and matching problems. Neural Comp. 2(1), 1-24. Received 4 December 1991; accepted 18 August 1992.
This article has been cited by: 2. Ya-Jun Zhang, Zhi-Qiang Liu. 2002. Self-splitting competitive learning: a new on-line clustering paradigm. IEEE Transactions on Neural Networks 13:2, 369-380. [CrossRef] 3. J. Vesanto, E. Alhoniemi. 2000. Clustering of the self-organizing map. IEEE Transactions on Neural Networks 11:3, 586-600. [CrossRef] 4. Isaac David Guedalia , Mickey London , Michael Werman . 1999. An On-Line Agglomerative Clustering Method for Nonstationary DataAn On-Line Agglomerative Clustering Method for Nonstationary Data. Neural Computation 11:2, 521-540. [Abstract] [PDF] [PDF Plus] 5. T. Hofmann, J.M. Buhmann. 1998. Competitive learning algorithms for robust vector quantization. IEEE Transactions on Signal Processing 46:6, 1665-1675. [CrossRef] 6. Chi-Sing Leung, Lai-Wan Chan. 1998. An error control scheme for transmission of vector quantization data over noisy channels. IEEE Transactions on Signal Processing 46:10, 2767-2780. [CrossRef] 7. T. Hofmann, J.M. Buhmann. 1997. Pairwise data clustering by deterministic annealing. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:1, 1-14. [CrossRef] 8. Ibrahim H. Osman, Gilbert Laporte. 1996. Metaheuristics: A bibliography. Annals of Operations Research 63:5, 511-623. [CrossRef] 9. Steven Gold, Anand Rangarajan, Eric Mjolsness. 1996. Learning with Preknowledge: Clustering with Point and Graph Matching Distance MeasuresLearning with Preknowledge: Clustering with Point and Graph Matching Distance Measures. Neural Computation 8:4, 787-804. [Abstract] [PDF] [PDF Plus] 10. David Miller , Kenneth Rose . 1996. Hierarchical, Unsupervised Learning with Growing via Phase TransitionsHierarchical, Unsupervised Learning with Growing via Phase Transitions. Neural Computation 8:2, 425-450. [Abstract] [PDF] [PDF Plus] 11. Jianchang Mao, A.K. Jain. 1996. A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Transactions on Neural Networks 7:1, 16-29. [CrossRef] 12. N. Barkai, H. Sompolinsky. 1994. Statistical mechanics of the maximum-likelihood density estimation. Physical Review E 50:3, 1766-1769. [CrossRef] 13. T L H Watkin, J -P Nadal. 1994. Journal of Physics A: Mathematical and General 27:6, 1899-1915. [CrossRef]
Communicated by John Moody and Steve Zucker
Clustering Data by Melting Yiu-fai Wong Department of Electrical Engineering, 116-81, California Institute of Technology,Pasadena, CA 91125 USA We derive a new clustering algorithm based on information theory and statistical mechanics, which is the only algorithm that incorporates scale. It also introduces a new concept into clustering: cluster independence. The cluster centers correspond to the local minima of a thermodynamic free energy, which are identified as the fixed points of a one-parameter nonlinear map. The algorithm works by melting the system to produce a tree of clusters in the scale space. Melting is also insensitive to variability in cluster densities, cluster sizes, and ellipsoidal shapes and orientations. We tested the algorithm successfully on both simulated data and a Synthetic Aperture Radar image of an agricultural site with 12 attributes for crop identification. 1 Introduction
Clustering is an important problem that can be found in many applications where a priori knowledge about the distribution of the observed data is not available (Duda and Hart 1973; Jain and Dubes 1988). Simply stated, the goal is to partition a given data set into several compact groups. Each group indicates the presence of a distinct category in the measurements. It is widely used for exploratory data analysis in diverse disciplines. The literature is therefore spread among many different fields over many years. It is almost impossible to cite each contribution individually. One of the early algorithms was invented by Lloyd (1982), which was later extended by Linde et al. (1980) for vector quantization. In pattern recognition, the ISODATA algorithm (Ball and Hall 1967) and its sequential version, the k-means clustering algorithm, have been extensively used. Other algorithms include the fuzzy techniques (Ruspini 1969; Bezdek 1981; Gath and Geva 1989; Rose et al. 1990) and the hierarchical techniques such as the agglomerative and divisive methods (see Wishart 1969). These algorithms, however, suffer from several difficulties: (a) they are highly sensitive to the initialization; (b) they perform poorly if the data contain overlapping clusters; and (c) they also suffer from the inability to handle variabilities in cluster shapes, cluster densities, and cluster Neural Computation 5,89404 (1993)
@ 1993 Massachusetts Institute of Technology
90
Yiu-fai Wong
sizes. The most urgent problem is the lack of cluster validity criteria (Ekzdek 1981). All the algorithms tend to create clusters even when no natural clusters exist in the data. In this paper, we examine a fundamental way of looking at the problem of clustering, and derive a new algorithm based on information theory and statistical mechanics. We identify clustering with heating up a thermodynamic system, giving rise to hierarchical clustering in the scale space. Melting can also account for variability in cluster densities, cluster sizes, and cluster shapes (ellipsoids). The algorithm was tested successfully on both simulated data and a SyntheticAperture Radar (SAR) image of an agricultural land with 12 attributes for crop identification (Wong et al. 1992;Wong and Posner 1992). A main contribution of this paper is that this interdisciplinary approach from information theory, thermodynamics, and nonlinear dynamics can provide a proper formulation for effective clustering and related optimization problems.
2 Scale and Cluster Independence
Intuition tells us that the number of clusters depends on the scale we look at the data. At a very coarse scale, the whole data set is a cluster, whereas at a very fine scale, every datum is itself a cluster. Scale has not been exploited by the other clustering techniques, though the idea of scale space has been around for a long time (Gabor 1946;Koenderink 1984). Wong (1992)introduced a concept called “cluster independence.” To explain it, consider the situation where several people are given the same data and the same rule about clusters. Each is told to stop once a cluster is found. If they do not communicate, it is clear that the assignments of the clusters are independent. If clusters indeed exist, the information should be present in the data itself. The notion of scale implies that the data points near the cluster centers should give more information while the data points far away should give less. This can be implemented by assigning a cost of having a data point reveal the cluster locations. To make a cluster robust, the information should be spread among the data. If we treat the contributions to the determination of a cluster from all the data points as a probability distribution, this means that this probability distribution should be chosen such that its entropy is maximized subject to a linear cost constraint Uaynes 1957). Cluster independence allows us to consider one cluster at a time. Suppose the cost function is e ( x ) = ( x - Y ) ~where x is a datum and y is a cluster center. This means that we use the squared distance as a measure of the compactness of a cluster. Let P ( x ) denote the contribution
Clustering Data by Melting
91
of datum x to y . Maximizing the entropy
-
c
P(X)logP(x)
(2.1)
X
subject to the constraint
C P ( x ) e ( x )= C
(2.2)
X
one obtains
p( x) = e-P(X-Y)’ /z
(2.3)
where Z = Cxe-fl(X-y)2.To make the connection with thermodynamics, we define the “free energy”
F=--
;
l0gZ
(2.4)
At equilibrium, it is known that a thermodynamic system settles into configurations that minimize its free energy. That is, we want a F / a y = 0, or equivalently, (2.5) the weighted mean of the data. We point out that equation 2.5 is very different from that obtained by the maximum-likehood estimate of a Gaussian mixture (Wolfe 1970; Cheeseman et al. 1988). Unlike these Bayesian approaches, our method does not assume any particular data distribution. Without loss of generality, we restrict the notation and the exposition to the case of one-dimensional data. The case of higher dimensional data was treated in Wong (1992). The good news is that the dynamics are essentially the same. Definition 1. A nominal cluster is centered at y if and only if y is u local minimum of the free energy of the thermodynamic system described above. Equation 2.5 is only a necessary condition for y to be a cluster center. The sufficient conditions will become clearer as the “melting” process is explained. The details can be found in Wong (1992). Because of that, we will use “cluster” instead of “nominal cluster.” Without worrying whether nominal clusters are real clusters, one can ask the following questions: 1. Do clusters exist? This depends on whether the equation has any solutions. 2. How many clusters are there? This depends on the number of
solutions the equation has.
Yiu-fai Wong
92
3. How do the clusters evolve? The answer is given by the trajectories of the solutions of equation 2.5 as p varies.
The above list could have been longer but it suffices to illustrate the importance of equation 2.5. Since we are concerned only with local minima, this is a great advantage over other applications of physical optimization where global optima are sought. 3 Melting and Its Dynamics
Solutionsof equation 2.5 cannot be computed analytically. However, they are identical to the fixed points of the following one-parameter map':
This connects our problem with nonlinear dynamics. Figure 1 is a plot of the map for 20 data points along a unit interval with a large p. It is clear that the free energy F acts as the Lyapunov function (Wiggins 1990) for the mapping. The difference between successive ys is -$3F/&j. Thus, the ys march down the surface of the free energy and settle down in some local minimum. The mapping 3.1 exhibits no chaotic behavior. Hence, solutions can always be computed iteratively and the convergence is exponentially fast. One can see that /3 truly captures the notion of scale. At a very large p, every datum is itself a cluster, while at a very small p, the whole data set is a cluster. The essence of the algorithm is thus as follows. Start with a huge p ( h e scale); initialize every datum as a cluster. As /3 is gradually decreased, the number of clusters decreases due to the merging of the clusters. When two clusters merge, the associated data points are merged as well. Eventually the whole data set is a cluster. Specifically, the Melting Procedure is as follows:
pmXis a number related to the dynamic range and an assumed noise in the observations;
1. Choose p-; 2. set i = 1,
=
3. let every data point be a cluster; 4. iterate according to the mapping 3.1 N times or until the clusters converge. In our simulations, N = 200;
5. record the new cluster centers; 'We could have instead established a similar equivalence with the differential equa- y)e-o(x-Y)', but the analysis is very similar and the results are the tion dyldt = same.
cx(x
Clustering Data by Melting
93
c
I 0
0.2
0.6
0.4
0.8
1
Y
Figure 1: The map for 20 data points along the unit interval. 6. if more than two clusters that previously are distinct share the same center, the set of data associated with the new cluster is the union of those with the original clusters; 7. i = i
+ 1, pi = &1/1.05;
8. if there is more than one cluster, go to 4; else Melting is complete.
It is clear that the Melting Procedure generates a strict tree structure in the scale space, analogous to a dendrogrum. Figure 2 shows an example of the Melting Procedure for a set of onedimensional data that has two clusters. The graphs are obtained by computing the fixed points of equation 3.1 as scale increases. The horizontal axis indexes i in the Melting Procedure. We merely identify scale with i, which is plotted logarithmically because of the exponential terms in equation 2.5. The original data are plotted as *s at i = 0. The dynamics involved in the merging process can be studied using local bifurcation theory (Wiggins 1990). The necessary condition for bifurcation to occur is af/tIy = 1. That is,
Yiu-fai Wong
94
-1.5' 0
20
40
60
80
100
I
I20
Figure 2: The fixed points versus scale. The leftmost points are the data.
In Wong (1992), two types of bifurcations were identified: pitchfork and saddle-node. Their bifurcation diagrams are shown in Figure 3, which show the trajectories of the fixed points as the parameter ,f3 is varied around its critical value. In a pitchfork bifurcation, two clusters continuously merge into a cluster while in a saddle-node bifurcation, a cluster becomes unstable and is siphoned into another cluster. Such bifurcations can be seen in Figure 2. The interpretation of these two bifurcations for cluster analysis is as follows (Wong 1992): A pitchfork bifurcation indicates (1) uniformly spaced or nonclustered data, or (2) clustered data but with a high degree of symmetry at certain scales. A saddle-node bifurcation indicates the inhomogeneous spatial distribution present in the data. As one expects, in clustering data, saddle-node bifurcations will be most frequently observed. It is now clear why we choose to "melt" the system starting from a low temperature, as contrasted with annealing (Kirkpatrick et al. 1983). Annealing would fail since saddle-node bifurcation implies that we do not know how much hill-climbing is needed to reach the other local minimum. We can also find an information-theoreticbasis for condition 3.2. The rate distortion function deals with the question of the minimum number of bits needed to encode a source symbol subject to an expected distortion
Clustering Data by Melting
a
Y
b
Y
95
stable branch unstable branch
5
1n=p
Figure 3 (a) Pitchfork bifurcation in our clustering scheme. bifurcationin OUT clustering scheme.
(b)
Saddle-node
constraint (Pierce and Posner 1980). For a Gaussian source with variance u2 and the average distortion 5 6,
(3.3) Thus, when equation 3.2 becomes an equality/ R(6) = 0 signifies that there is no need to waste bits to encode the source. The cluster should either disappear or be merged. 4 What Is a Good Cluster and
How Many Are There?
One needs a criterion to decide the good clusters among all the clusters in the scale space. We will briefly outline the ideas in Wong (1992). Recall that p ( x ) is the contribution of a data point to a cluster. Thus the quantity fractional free energy (FFE) of a nominal cluster (4.1)
Yiu-fai Wong
96
I
0
0.2
0.4
0.6
0.8
1
Figure 4 (a) Data and the computed clusters illustratingability to handle many
clusters.
is a measure how good a cluster Q is. A large FFE indicates that most of the contributions come from the data belonging to the cluster itself and vice versa. What is large or small is set by a threshold MT,which expresses a degree of confidence. Hence, by keeping track of the fixed points and their FFE values, a criterion for deciding "good clusters" was defined in Wong (1992). We need to select the real clusters among the good clusters. It is very difficult to define a universally accepted criterion because clusters really need to be interpreted in the context of the specific applications. Nonetheless, an attempt to define a scale-based criterion was carried out in Wong (19921, which has found to be applicable in the radar application (Wong and Posner 1992). , If distinct good clusters exist in the data, their FFEs should remain good over a large range of logarithmic scale in p even though the fixed points may vary their positions slightly. In addition, the FFEs of these clusters should start out with very high value, only to drop quickly when they are about to bifurcate, hence, for a good cluster, the longer its FFE remains high, the more robust it is.
Clustering Data by Melting
'
97
O t
-0.2 0
10
20
30
40
50
60
70
80
90
0
Figure 4 (b) x-components of the trajectories of the cluster centers versus scale.
1-
0.9 0.8 -
0.7 0.6 -, 0.5
<
0.4 -;
0.3 -j 0.2 -; 0.1
-;
00
Figure 4 (c) Plots of fractional free energies for the data points in a.
Yiu-fai Wong
98
Figure 4b shows the x-component of the trajectories of the clusters for the data shown in Figure 4a.2 Figure 4c is a plot of the FFEs of the clusters versus scale. One sees that there is a range of scale over which three clusters exist. But there is a longer range of scale over which four clusters exist, which is the correct answer. Here is how we formally define the robustness of a good cluster: Definition 2. The robustness ofa good cluster is defined as the range of logarithmic scales over which its FFE remains above MT.
The rule to decide the number of clusters is to pick out the most robust ones until there are no more good clusters left. Here is the Melting Algorithm (Wong 1992): 1. Perform the Melting Procedure; 2. decide the good clusters among the nominal clusters; denote the set
by 7 = {Tl, T2,.. .
3. compute the robustness of the good clusters;
4. initialize U to an empty set; 5. while 7 is nonempty, do the following:
a. pick the element Tk in 7 with the biggest robustness measure; b. put this element into U; c. remove Tk and the elements in 7 that either are contained in or contain T’; 6. collect the data points that do not belong to one of the clusters in U into a set N, which we hope is empty.
Several remarks about the above algorithm: (1)The algorithm actually consists of two parts: melting and determination of the clusters. (2) The Melting Procedure is governed by the scale parameter ,f3 only. (3),,8, should be chosen such that the number of clusters at i = 2 is not significantly less than the number of data points to start with. Otherwise, the initial temperature is too high, which might cause premature partitioning of the data. p0 can be obtained easily by simple preprocessing. (4) The determination of the clusters is camed out in steps 2-5. We also note that step 5 can be modified to further study the finer structure of the data, such as clusters within clusters. *The data were generated from normal distributions. A “cross” denotes the center of the distribution as seen by the computer. A “circle”denotes the representative of a cluster that is just the arithmetic mean of the data in a given cluster. The horizontal axis indexes scales with fewer than 10 clusters in the Melting Procedure (to avoid too many curves). The same explanations apply to Figure 5.
Clustering Data by Melting
99
5 Ability to Handle Clusters of Oriented Ellipsoidal Shapes
Figure 5a shows a data set consisting of four clusters with various orientations and ellipsoidal shapes. Figure 5b shows the trajectories of the clusters. Figure 5c is the plots of the FFEs; it clearly shows that there are four clusters. Figure 5d shows the partition obtained by the algorithm. Note the few data points marked by Os, which get grouped into clusters different from that generated by the computer. Since they are far away from the originating cluster, such grouping is acceptable. Even without a norm which is biased in the different directions, there is a built-in dynamics in the formulation to handle oriented ellipsoidal shapes with a single p. This was also demonstrated in the radar application (Wong and Posner 1992). Here is a brief explanation. Obviously, the dynamics of the mapping 3.1 is invariant to the rotation of the coordinate system. Suppose, as is reasonable, that each cluster consists of data coming from a source corrupted with unimodal noise. Due to insufficientsampling or pure random fluctuation, the local density is not monotonically decreasing. However, at a coarser scale, it will be monotone. This implies that sooner or later, a cluster will see that its center cannot be balanced due to the monotonicity. It will try to "swim" toward the gradient until balance is reached, which is possible only in the neighborhood around the true signal.
0.8
0.60.4 -
0.2 0-0.2
-
-0.4 -
-0.6 -0.8 -1-
.0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 5: (a) Data illustrating ability to handle many clusters of different shapes and sizes.
Mu-fai Wong
100
0.1
0
-0.1
-0.1
-0.3
-0.4
.---.______
-0.5
10
20
30
40
so
3
Figure 5: (b) y-components of the trajectories of the cluster centers versus scale.
1
0.9 0.8
0.7 0.6 0.5 0.4
0.3 0.1 0.1 (1
Figure 5: (c) Plots of fractional free energies for the data points in a.
Clustering Data by Melting
0.4
101
1 e
0.2 -
.
,
.
.
. .. .. . .. . .... .. . . . . . . , O ' , .. , .
0-
e
e B
*+
. . . . '0 . . ........... ....
-0.2 -
t +
2
0 '
-0.4 -
-0.6 -
-1
'
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Figure 5: (d) Clustering of the data shown in a. 0, misclassified data points. 6 Computational Complexity and Other Observations
Instead of artificial data, we will illustrate the timing on a real application to the clustering and classification of a 12-dimensional SAR image of an agricultural area (Wong and Posner 1992). Due to the attracting dynamics, the convergence is exponentially fast. It was observed that convergence to a fixed point took an average of 15 iterations at each p. The exact rate of convergence, however, depends on the Jacobian of the map (3.11, which cannot be known a priori, making it impossible to give an upper bound. For the application, the computation took 26 minutes of SPARC-I1 cpu time to find the clusters. This is very intensive compared to ISODATA (Ball and Hall 19671, which takes 3 sec provided, however, it is given the right initialization. To compare ISODATA with Melting Algorithm, we note the key point that if we initialize ISODATA wrongly, it will never find the correct clusters. Here, "wrongly" means putting more than one initial cluster in a real cluster. For the application, we have 1397 data points. There are 13 clusters, each with about 106 data points. Suppose that the initial cluster centers are assigned randomly. There are
102
Yiu-fai Wong
choices. Of these, only 10613= 2.13 x 1p6 initializations give the correct partition; even this is an overestimate since some data are noisy. Hence, the probability of a correct initialization is at most 1.82 x Since ISODATA is 520 times faster than our algorithm, its probability of getting the correct answer is 0.0095 in 26 min, which would have been lower had it not been given the number of clusters. This simple calculation shows that in an obvious sense the Melting Algorithm is at least 105 times (1/0.0095)better than ISODATA. Furthermore, one has to weigh the quality and assurance of the solution obtained by our Melting Algorithm. The current implementation of the Melting Algorithm does not include any heuristics to speed up its computation though it did use a lookup table of the exponential function. We note the following along these lines: 1. Initially, there are a huge number of clusters. Most clusters will merge quickly since they exist simply because it is too ”cold.” This effect can be seen in Figure 2. Some preprocessing such as simple grouping would reduce the complexity dramatically. 2. The purpose of the Melting Procedure is to track the trajectories of the clusters in the scale space. Instead of decreasing p by a constant factor, we can also utilize numerical techniques such as continuation (Doebel1986)and adaptive step size selection to track the bifurcation points more accurately and faster.
3. The algorithm is ideal for parallel implementation because of local calculations and cluster independence. In addition, it is possible that some partial a priori information will allow one to perform melting over a range of scales. Thus, the computational complexity of the algorithm can be improved significantly with the techniques outlined above and other heuristics that we have yet to investigate. Some preliminary work is reported in Tam (1992). 7 Summary
Clustering is a hard problem. The traditional clustering algorithms suffer from several difficulties. The willingness of existing algorithms to partition any set of data suggests that they may more suitably be named “partitioning” algorithms rather than “clustering” algorithms. In this work, we have devised a new clustering algorithm that properly exploits the notion of scale. We also introduced the notion of cluster independency, which has not been formally recognized by prior researchers. It permits the natural application of the maximum entropy principle? Cluster centers correspond to the local minima of a thermo3For related results on clustering using maximum entropy principle, see Rose et al. (1990)and the work by J. Buhmann and H. Kiihnel in this issue.
Clustering Data by Melting
103
dynamic free energy. The system is identical to a one-parameter nonlinear map, which can be rigorously analyzed using bifurcation techniques. Melting the system produces a tree of clusters in the scale space. Melting can also account for variabilities in cluster densities, sizes and shapes (ellipsoidal). We further tested this algorithm on the clustering and classification of a Synthetic Radar Aperture image of an agricultural site with 12 attributes. Since clustering is a form of unsupervised learning, we expect this work should provide some new insights for neural network research and optimization theory, too, but we will not discuss that here. Acknowledgments I am deeply indebted to my thesis advisor Edward C. Posner for his patience, encouragement, and advice. I benefited much from him. I thank Professor Steve Wiggins at Caltech for clarifying some concepts on the local bifurcations. This work is supported by Pacific Bell through a grant to the California Institute of Technology and by NASA through the Caltech Jet Propulsion Laboratory, as well as a Charles Lee Powell Foundation Graduate Fellowship at Caltech. References Ball, G., and Hall, D. 1967. A Clustering Technique for Summarizing Multivariate Data. Behuv. Sci. 12, 153-155. Bezdek, J. C. 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum, New York. Cheeseman, P. et al. 1988. Autoclass: A Bayesian classificationsystem. Proceedings of the 1988 Machine LRarning Workshop. Doedel, E. 1986. AUTO: Softwure for Continuation and Bifurcation Problems in Ordinary Differential Equations. Tech. Rep., Applied Mathematics, Caltech. Duda, R. O., and Hart, I? E. 1973. Pattern Cluss$cutionand Scene Anulysis. Wiley, New York. Gabor, D. 1946. Theory of communication. 1.IEE 93,429-457. Gath, I., and Geva, A. B. 1989. Unsupervised optimal fuzzy clustering. IEEE Trans. Pattern Anal. Machine Intell. PAMI-11,773-781. Jain, A. K., and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ. Jaynes, E. T. 1957. Information theory and statistical mechanics I. Phy. Rev. 106, 620-630. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimizationby simulated annealing. Science 220,671480. Koenderink, J. J. 1984. The structure of images. Biol. Cybern. 50,363-370. Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantization. IEEE Trans. Commun. COM-28, 84-95.
104
Yiu-fai Wong
Lloyd, S. P. 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28(129), 137 (reprint of the 1957 paper). Pierce, J. R., and Posner, E. C. 1980. Introduction to Communication Science and Systems. Plenum Press, New York. Rose, K., Gurewitz, E., and Fox, G. C. 1990. A deterministic annealing approach to clustering. Pattern Recog. Lett. 11, 589-594. Ruspini, E. 1969. A new approach to clustering. Inform. Contr. 15,22-32. Tam, T. K. 1992. Fast and Parallel Implementation of Melting Algorithm for Clustering. Caltech Summer Undergraduate Research Fellowship Report. Wiggins, S. 1990. Introductions to Applied Nonlinear Dynamical Systems and Chaos. Springer-Verlag,New York. Wishart, D. 1969. Mode analysis: A generalization of nearest neighbor which reduces chaining effects. In Numerical Taxonomy, A. J. Cole, ed., pp. 282-308. Academic Press, London. Wolfe, J. H. 1970. Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research 5, 329-350. Wong, Yiu-fai, Peters, K. J., and Posner, E. C. 1992. Unsupervised and hierarchical cluster analysis and classificationof SAR images. Proceedings International Space Year Conference on Earth and Space Science Information Systems, Pasadena, February, to be published by AIR Wong, Yiu-fai, and Posner, E. C. 1992. A new clustering algorithm applicable to multispectral and polarimetric SAR images. IEEE Trans. Geosci. Remote Sensing, in press. Wong, Y. F., and Posner, E. C. 1992. Scale-space clustering and classification of SAR images with numerous attributes and classes. To be presented at 1992 IEEE Workshop on Applicationsof Computer Vision, November, Palm Springs. Wong, Yiu-fai 1992. T'ards a Simple and Fast Learning and Classification System. Ph.D. Thesis, Caltech, Electrical Engineering. Received 16 January 1992; accepted 12 May 1992.
This article has been cited by: 2. Miguel A. Carreira-Perpinan. 2007. Gaussian Mean-Shift Is an EM Algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:5, 767-776. [CrossRef] 3. Min Wang, Yee Leung, Chenhu Zhou, Tao Pei, Jiancheng Luo. 2006. A Mathematical Morphology Based Scale Space Method for the Mining of Linear Features in Geographic Data. Data Mining and Knowledge Discovery 12:1, 97-118. [CrossRef] 4. Fahui Wang. 2005. Job Access and Homicide Patterns in Chicago: An Analysis at Multiple Geographic Levels Based on Scale-Space Theory. Journal of Quantitative Criminology 21:2, 195-217. [CrossRef] 5. Koji Okuhara, Koji Sasaki, Shunji Osaki. 2000. Reproductive and competitive radial basis function networks adaptable to dynamical environments. Systems and Computers in Japan 31:13, 65-75. [CrossRef] 6. Yee Leung, Jiang-She Zhang, Zong-Ben Xu. 2000. Clustering by scale-space filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 22:12, 1396-1410. [CrossRef] 7. Isaac David Guedalia , Mickey London , Michael Werman . 1999. An On-Line Agglomerative Clustering Method for Nonstationary DataAn On-Line Agglomerative Clustering Method for Nonstationary Data. Neural Computation 11:2, 521-540. [Abstract] [PDF] [PDF Plus] 8. A.N. Srivastava, R. Su, A.S. Weigend. 1999. Data mining for features using scale-sensitive gated experts. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:12, 1268-1279. [CrossRef] 9. K. Rose. 1998. Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proceedings of the IEEE 86:11, 2210-2239. [CrossRef] 10. Marcelo Blatt , Shai Wiseman , Eytan Domany . 1997. Data Clustering Using a Model Granular MagnetData Clustering Using a Model Granular Magnet. Neural Computation 9:8, 1805-1842. [Abstract] [PDF] [PDF Plus] 11. Koji Tsuda, Shuji Senda, Michihiko Minoh, Katsuo Ikeda. 1997. Sequential fuzzy cluster extraction and its robustness against noise. Systems and Computers in Japan 28:6, 10-17. [CrossRef] 12. S.V. Chakravarthy, J. Ghosh. 1996. Scale-based clustering using the radial basis function network. IEEE Transactions on Neural Networks 7:5, 1250-1261. [CrossRef] 13. N. Barkai, H. Sompolinsky. 1994. Statistical mechanics of the maximum-likelihood density estimation. Physical Review E 50:3, 1766-1769. [CrossRef] 14. Joachim Buhmann , Hans Kühnel . 1993. Complexity Optimized Data Clustering by Competitive Neural NetworksComplexity Optimized Data Clustering by
Competitive Neural Networks. Neural Computation 5:1, 75-88. [Abstract] [PDF] [PDF Plus] 15. Ravi KothariData Mining . [CrossRef]
Communicated by John Platt
Coarse Coding Resource-Allocating Network Gustavo Deco Jiirgen Ebmeyer Siernans AG, SI AT 3, Landshuter Strusse 26, 8044 Miinchen-Unterschleissheim, Miinchen, Germany In recent years localized receptive fields have been the subject of intensive research, due to their learning speed and efficient reconstruction of hypersurfaces. A very efficient implementation for such a network was proposed recently by Platt (1991). This resource-allocating network (RAN)allocates a new neuron whenever an unknown pattern is presented at its input layer. In this paper we introduce a new network architecture and learning paradigm. The aim of our approach is to incorporate "coarse coding" to the resource-allocatingnetwork. The network presented here provides for each input coordinate a separate layer, which consists of one-dimensional, locally tuned gaussian neurons. In the following layer multidimensional receptive fields are built by using pi-neurons. Linear neurons aggregate the outputs of the pineurons in order to approximate the required input-output mapping. The learning process follows the ideas of the resource-allocating network of Platt but due to the extended architecture of our network other improvements of the learning process had to be defined. Compared to the resource-allocating network a more compact network with comparable accuracy is obtained. 1 Introduction
Several authors have analyzed localized receptive fields for solving the problem of hypersurface reconstruction. Layered neural networks with locally tuned units (Moody and Darken 1989) or hyper-basis functions (Poggio and Girosi 1990a,b) have been formulated and successfully applied. The use of such gaussian neurons was proposed due to their relation with the theory of approximation. Lapedes and Farber (1987) suggested further that multilayer neurons, with sigmoidal functions are effective because they build gaussian-like bumps, which perform the approximation of the hypersurface. The use of gaussian neurons has brought improved accuracy and faster learning. The traditional pattern recognition methods, such as the k-nearest neighbors or Parzen windows, suffer from the same disadvantage as localized receptive field networks: they allocate too many units Neural Computution 5,10S114 (1993) @ 1993 Massachusetts Institute of Technology
106
Gustavo Deco and Jiirgen Ebmeyer
during the learning phase. This may be problematic for the network's capability to generalize, as pointed out by Platt (1991). Recently, a very ingenious architecture and learning paradigm, called a resourceallocating network (RAN), was proposed by Platt (1991). The basic idea of this network is the construction of an architecture capable of adjusting efficiently the number of units to obtain a compact and very accurate network. Poggio (1990) has pointed out that radial basis functions are factorizable and that their synthesis is much easier when factorized. This method is called "coarse coding." Poggio suggested that the model that could explain how 3D objects are learned, by the cortical cells involved in visual face recognition. He also discussed the relation of this model to theories of cerebellum and motor control (Poggio 1990). The aim of the present work is to formulate a RAN-like self-building architecture which uses coarse coding to improve the accuracy and reduce the complexity of the network. We call this a coarse coding resourceallocating network, henceforth referred to as CC-RAN. In the first section of the paper the architecture of this network is described and in the second section the learning algorithm is presented. This network was implemented in C++ on a Unix-Workstation and tested with the standard benchmark Mackey-Glass chaotic time series. Results and discussion are given in the final section. 2 Architecture of the Network
In this section the architecture of CC-RAN is described. To simpl* the theoretical description we will discuss in this paper the network architecture for approximation of functions of R" to R. The generalization for functions of R" to R" follows trivially. Figure 1 shows the graphic representation of CC-RAN. Each input variable Xi is connected with a separate layer R F k of one-dimensional, gaussian receptive fields. The second layer consists of pi-neurons (Rumelhart etal. 1986)that synthesize the factorized receptive fields by selecting one local unit per lW layer and multiplying them together. As output layer, one linear neuron performs the linear combination of the multidimensional functions. The neurons of the first layers R F k (receptive field k) are locally tuned units. The neuron m of this layer R F k responds active only when the coordinate xk of the presented input x is in the neighborhood of the center cr. We implemented these neurons following Platt (1991) and Moody and Darken (1989) but now, as one-dimensional gaussian fields, (2.1)
where C T ~ is the width and c r the center of the gaussian function.
Coarse Coding Resource-Allocating Network
x
,
x
*
x3
107
.......
X k
.......
Xn
Figure 1: Graphic representation of the coarse coding resource-allocating network introduced in this paper. Each input coordinate is connected with onedimensional locally tuned neuron in a separate RF layer. The synthesis of the one-dimensional receptive fields is performed by pi-neurons. The height and combination of this multidimensional gaussian function is calculated in the linear output neuron to achieve a best approximation to the given hypersurface.
The pi-neurons of the second layer perform the following operation (see also Durbin and Rumelhart 1989),
p' -
n
i p p
k = 0,. . . ,dimension of input vector
(2.2)
' - ( k
The k-index extends to all RF layers. In equation 2.2pi is the output of the pi-neuron i. The connection of pi-neuron i with the layer R F k [indicated by n(k,i)l is determined during the learning process. The
Gustavo Deco and Jurgen Ebmeyer
108
linear output neuron evaluates the outputs of the pi-neurons by
i = 0, . . . ,number of pi-units
The y gives the output when all pi-units are deactivated. Wi are the synapses between the pi-neurons of the second layer and the output neuron. These weights represents the height of the synthesized gaussian function. This structurewill be constructed and adjusted optimally by the leaming algorithm. This is described in the next section. The coarse coding was introduced by using a layer of one-dimensional, locally tuned units for each coordinate of the input vector. It is interesting to remark that this network satisfies the Stoneweirstrass theorem. This means that this network describes a universal approximator for real-valued maps defined on convex, compact sets of R" (see Hartman et al. 1990). We define the set,
It is easy to prove that N is an algebra of real continuous functions on a compact set Q, and that it separates points of Q and that it does not vanish at any point on Q; these are the preconditions of the Stone theorem. So a uniform closure of N contains all real valued continuous functions of Q. 3 Learning Paradigm for CC-RAN
We follow 'the idea of Platt (1991) for the implementation of the allocation and learning rules for CC-RAN.The network generation process starts initially without any neuron. Each layer of the one-dimensional receptive fields (RF)has an allocator. This allocator creates a new locally tuned neuron in layer R F k if two novelty conditions are satisfied. The first condition considers a pattern as new if the minimal distance of the coordinate xk relative to the centers c? is greater than a determined barrier d, i.e.,
The second condition is given by llTP - OPll
'
E
(3.2)
which means that new neurons will not be created for smaller corrections.
Coarse Coding Resource-Allocating Network
109
The barrier d ( t ) shrinks each training epoch following the exponential decay,
d ( t ) = max(Minima1d, e-f’r Maximal d )
(3.3)
In equation 3.3 T is the decay factor. The width and the center of the created neurons are given by
(3.4) (3.5) The second layer of pi-neurons has its own allocator of pi-neurons. This allocator creates a new pi-neuron in two different cases. The first case requires the satisfaction of two conditions. A new pi-neuron is created if at least one of the RF layers has created a new one-dimensional gaussian unit. The second condition demands equation 3.2 to be satisfied. The new created pi-neuron will be connected to the new one-dimensional gaussian units. In the others RF layers where no new gaussian unit has been created, the gaussian unit with the nearest center to the respective input will be connected to the new pi-neuron. The second case to create a pi-neuron is present, if no new gaussian neurons are created and no pi-neuron exists that connects the gaussian neurons in such a way that the distance between the input and the center of the multidimensional receptive field (built by all RF layers) is smaller than d ( t ) . In this case the new neuron will connect the neurons of the different RF layers with the nearest center to each input. The new connections, Wi, of the newly created pi-neuron with the output layer are given by
After the coarse representation of the function, new units with smaller widths are successively inserted until the network has learned the examples with the desired accuracy and resolution. This resolution is essentially given by the minimal value of d ( t ) . In the case that neither gaussian neurons nor pi-neurons are created the traditional gradient method for smaller corrections is used. The correction equations are
(3.8)
Gustavo Deco and Jiirgen Ebmeyer
110
In equation 3.9 the summation is extended over all ,'i so that n ( K , i') = i, in other words all pi-units that are connected with the gaussian unit i of layer k. As pointed out by Platt, the two novelty conditions are necessary for creating a compact network. In this way new neurons represent exactly the novelty patterns, and the small aberrations are corrected using the gradient method. Due to the use of pi-neuron and the independent allocation of RF neurons at each input dimension, CC-RAN forms more compact networks than RAN does, because the storage of each dimension happens separately, so that one particular RF center can be used in many pi-neurons without having to be restored in a new neuron every time. 4 Results
The standard test case for regularization networks is the prediction of the chaotic time series defined by Mackey and Glass (1977). The delay difference equation of Mackey-Glass can be expressed as x(t
T) + 1) = (1- b)x(t)+ 1u+ xx (( tt -- T)'O
(4.1)
We tested CC-RAN with u = 0.2, b = 0.1, and T = 17. To compare with other neural and nonneural models the training set contained four points: x ( t ) , x ( t - 6), x ( t - 12), x(t - 18) and the network had to predict the output x ( t + 85). The used parameters for CC-RAN are = 0.02, 7 = 25, dmax = 0.7, dmin = 0.07 or 0.03, E = 0.05, IC = 0.87, and 400 learning epochs. The test set consists of 500 points of the output of the Mackey-Glass equation at t = 4000. Figure 2 shows the normalized error (rms error divided by the square root of the variance of the output of equation 4.1 as a function of the training set size). A normalized error equal to one indicates that the prediction is not better than the mean value. For comparison the curves of other approximation methods are also given. A better generalization capability can be observed when a training set with 100 points is used. The measurement of the complexity of the network is shown in Figure 3. This figure relates the dependence of the number of weights (height, width, and centers of the gaussian functions) of the networks to the size of the training set. If the size of the training set grows, the increase in complexity is very small for the RAN and CC-RAN network. Figure 4 shows the normalized error as a function of the number of weights. In this figure it can be seen that CC-RAN is a remarkably compact network (small number of weights) achieving a similar accuracy as the other ones. Even in those cases where CC-RAN (dmin = 0.03) performs more accurately than RAN, its complexity is still smaller than that of RAN. The essential fact that contributes to this compactness is the use of the same one-dimensional receptive field by many different pineurons. The improvement of the independent creation of RF layers lies
Coarse Coding Resource-Allocating Network
5
1
100
1
I
I I I I I I
1
I
111
I
1 I I I I I
lo00
I
I
l
l
10000
Slze of Training Set
Figure 2 Normalized error as a function of the size of training set. ( 0 )CC-RAN (dmin = 0.07); (*I CC-RAN (dmin = 0.03); (0)RAN ( E = 0.05, of Platt 1991); (*I Backpropagation; (A)hashing B-spline; (0)standard RBF; (m) K-means RBF. All results of the other approximationsare taken from Platt (1991). in the fact that they permit us to learn and adapt compactly and efficient each component of a multidimensional input that can be strongly asymmetric and/or of different range and dispersion. This was the reason for the implementation of coarse coding as a resource-allocating network. 5 Conclusions
A new architecture and learning algorithm, called coarse coding resourceallocating network is proposed for a self-building neural network. The network learns by allocating new neurons to the receptive field layers as well as to the pi-neuron layer, if determined novelty conditions are
112
Gustavo Deco and Jiirgen Ebmeyer
She of Training Set
Figure 3: Number of weights of the networks or algorithms as a function of the size of training set. ( 0 )CC-RAN (dmin = 0.07); ( a )CC-RAN ( d m i n = 0.03); (0) RAN ( E = 0.05,of Platt 1991); (*) Backpropagation; (A) hashing B-spline; (Q) standard RBF; (B) K-means RBF. All results of the other approximations are taken from Platt (1991).
satisfied (i.e./ if substantial differences to already learned patterns are found). If only small aberrations between expected and actual output are observed, only small adjustments are made by applying of the gradient correction method to the centers of each one-dimensional receptive field and on the height of the synthesized multidimensional gaussian function by the pi-neurons. More compact networks are obtained by separate allocation of RF centers for each dimension, which may be used in many pi-neurons without having to be restored in a new neuron every time. The compactnessachieved by the final network in learning the chaotic series of Mackey-Glass was important for improved generalization.
Coarse Coding Resource-AllocatingNetwork
113
E
100
lo00
1woo
Number of Weighta
Figure 4 Normalized e m r as a function of the number of weights of the networks or algorithms. ( 0 )CC-RAN (dmin = 0.07); (*)CC-RAN (dmin = 0.03); (0) RAN ( E = 0.05, of Platt 1991); (*)Backpropagation; (A) hashing B-spline; (0) standard RBF; (a)K-means RBF. All results of the other approximations are taken from Platt (1991). Acknowledgments We gratefully acknowledge helpful comments from the editor and J. Cuellar.
References Durbin, R., and Rumelhart, D. 1989. Product units: A computationally powerful and biologically plausible extension to backpropagation networks. Neural Comp. 1, 133-142.
114
Gustavo Deco and Jurgen Ebmeyer
Hartman, E., Keeler, J., and Kowalski, J. 1990. Layered neural networks with gaussian hidden units as universal approximations. Neural Comp. 2, 210215. Lapedes, A., and Farber, R. 1987. Nonlinear signal processing using neural networks: Prediction and system modeling. Tech. Rep. LA-UR-87-2662, Los Alamos National Laboratory, Los Alamos, NM. Mackey, M., and Glass, L. 1977. Oscillation and chaos in physiological control systems. Science 197, 287. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 3, 281-294. Platt, J. 1991. A resource-allocating network for function interpolation. Neural Comp. 3,213-225. Poggio, T. 1990. A theory of how the brain might work. A.I. Memo No. 1253, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Poggio, T., and Girosi, F. 1990a. Networks for approximation and learning. Proc. IEEE 78(9), 1481-1497. Poggio, T., and Girosi, F. 1990b. A theory of networks for approximation and learning. A.I. Memo No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning representations by back-propagating errors. Nature (London) 323,533. Received 29 January 1992; accepted 27 April 1992.
This article has been cited by: 2. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 3. G. Deco , W. Finnoff , H. G. Zimmermann . 1995. Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer NetworksUnsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks. Neural Computation 7:1, 86-107. [Abstract] [PDF] [PDF Plus] 4. Bjørn Lillekjendlie, Dimitris Kugiumtzis, Nils Christophersen. 1994. Chaotic time series. Part II. System Identification and Prediction. Modeling, Identification and Control: A Norwegian Research Bulletin 15:4, 225-245. [CrossRef]
Communicated by Michael Jordan
Training Periodic Sequences Using Fourier Series Error Criterion James A. Kottas* h4IT EECS Department, Massachusetts Institute of Technology, Cambridge, MA 02139 USA Training a network to learn a set of periodic input/output sequences effectively makes the network learn a mapping between amplitudes and phases in Fourier space. The spectral backpropagation (SBP) training algorithm is a different way of doing this task. It measures the Fourier series components of the output error sequences and minimizes the total spectral energy as an adaptation criterion. This approach can train not only the weights but also time delays associated with the interconnects. Furthermore, the cells can have finite bandwidth via a first-order low-pass filter. Having adaptable time delays gives the SBP algorithm a powerful way to control the phase characteristics of the network. 1 Introduction In many areas such as adaptive controls, speech, and signal processing, the signals of interest are functions of time rather than vectors to be classified. Many researchers have investigated processing temporal sequences using neural networks (for example, see Jordan 1989; Pearlmutter 1989; and Pineda 1989). Consider the special case when the sequences are periodic. This can occur, for example, when trying to recognize and distinguish between the characteristic limit cycles produced by a neural model of the olfactory bulb (Yao and Freeman 1990). One way to do this task is to use a timedelay neural network (Waibel et al. 1989). This type of network essentially converts a temporal sequence into a vector and then uses a parallel mapping network to perform any recognition or classification. The network can recognize a sequence after one period and is not constrained to use periodic sequences. However, the serial input buffer (the time-delay part) must be as long as the minimum sequence period. For long sequences, the buffer could become quite large. Another way to solve this type of problem is to use a network for performing amplitude and phase processing on the sequence to transform it 'Current address: Syrnbus Technology, 1330 Beacon Street, Suite 249, Brookline, MA 02146 USA.
Neural Computation 5,115-131 (1993) @ 1993 Massachusetts Institute of Technology
JamesA. Kottas
116
into a form that can be recognized more easily. This approach involves training a network to learn a set of periodic input/output sequences. Although algorithms such as Pearlmutter’s (1989) continuous-time formulation of the conventionalbackpropagationequations (Rumelhartet d. 1986) and Pineda’s (1989) recurrent backpropagation algorithm could be used here, the technique described herein called the spectral backpropagation (SBP) algorithm offers definite advantages. The SBP algorithm uses the Fourier series spectrum of the desired output sequences to form a spectral error criterion that is then backpropagated through the network. Essentially, the SBP algorithm is an extension of the conventional recurrent backpropagation algorithm for descending an error surface in the Fourier domain. Although more computationally intensive, the SBP algorithm allows not only the weights but also time delays associated with the interconnects to be trained. This ability allows the training process direct access to the phase components of the sequence relationships to be learned. Furthermore, the cells in the network can have finite bandwidth via a first-order low-pass filter. When the cells have nonlinear output functions, the resulting spreading of spectral energy can be handled without difficulty. However, having adaptable time delays can permit simple linear networks to be used for processing temporal sequences rather than more complex nonlinear networks. In this paper, the basic structure of the SBP training algorithm is described along with representative simulation results. Throughout the discussion, continuous-time functions are denoted by f (t) and discrete-time functions by f [t]. 2 Spectral Backpropagation ’Raining Algorithm
Consider a network of cells that are described by the continuous-time equations, (2.1)
(2.2) (2.3) where ui(t) is the weighted input sum for the ith cell at time t, xi(t) is the external input, vi(t) is the filtered cell input with time constant rv, and yi(t) is the cell output generated from the filtered input via the function Si. Each interconnect to the ith cell from cell j has weight wi, and time delay Tg. Assuming yi(t) is smooth, periodic, and “slowly varying,” it can be approximated by the truncated Fourier series, K
yj(t) z
C [Ykcos(k~ot) + yksin(kwot)]
k=O
(2.4)
Training Periodic Sequences
117
where wo = 2s/T0 with TObeing the fundamental period of the oscillation and K the highest harmonic to consider. The coefficients Yfk and y k represent the real-valued Fourier series amplitudes for the cosine and sine components, respectively. They can be computed using the transform,
yk
=
7 7
bk yi(f) COS(kkJ0t)dt
(2.5)
To 0 q k
= pk
lji(t)Sin(kLdof)dt
(2.6)
To 0 where
1 for k = O
(2.7)
2 fork > 0 Using the vector notation yik
=
[t]
(2.8)
the cell equations can be transformed into the Fourier domain, resulting in CoS(k~0Tij) - sin(kw0Tij) ’ yjk (2.9) ujk = x i k -k wjj sin(kwoTij) cos(kw0Tjj) i
[
vik
=
I
1
(2.10)
(2.11)
For linear cells with yi(t) = mivi(t)
(2.12)
where mi is a gain constant, the spectral cell output is simply Yjk
= miVjk
(2.13)
Note that the transformation into the Fourier domain causes the time delays Tij to become a simple quadrature phase matrix. Similarly, the cell time constant ru becomes an amplitude scaling factor that depends on the spectral hequency component kwo. The spectral error criterion, which is the basis for the SBP algorithm, is obtained by comparing the actual output sequence yi(t) to the desired
James A. Kottas
118
output sequence yjd'(t). In the time domain, the error as a function of time is (2.14) ei(t) = yjd'(t) - yi(t) where i refers only to a network output cell here, The total error for the current output sequence over all output cells (No) is given by (2.15) In the same way, the total error over all output sequences is simply the sum of E for each sequence. Since this is a linear operation, the derivation below will be done as if there was only one desired output sequence. The results are then summed over all output sequences to obtain the complete error criterion. Using Parseval's theorem, the error in equation 2.15 can be approximated by (2.16) where Ef, and E:k are the Fourier series coefficients of the temporal error sequence defined in equation 2.14. As in the conventional backpropagation algorithm, the weights and time delays are adapted according to the gradient descent driving term, (2.17) where zi, can be either a weight w, or a time delay T, and ~a(r) is the correspondingadaptation time constant with units of numbers of training epochs. Using an approach analogous to that given for the conventional backpropagation algorithm in Rumelhart et al. (19861,expressions for Awij and AT, can be derived. For output cells, the spectral cell errors for the ith cell at the kth frequency index (k = 0,1,2, . .. ,K) can be defined as r aE 1
;1
av,F,
(2.18)
6ik = -
allowing the driving term to be written as (2.19) using the chain rule. Note that the spectral cell errors 6ik are independent of z,, the weight or time delay being adapted.
Training Periodic Sequences
119
The term that is dependent on Zij, dVik/azij VZijVik, can be derived from the spectral cell equations in 2.9 and 2.10. When zij is the weight wij, d V i k / h i j = VwijVik is given by (2.20) where
and
These coefficients incorporate the effects of both the cell filter and the interconnect time delais. When zi, is the time delay Tq, aVik/aTV = VTijVik is given by2 Alk (Tij)
8Tij
(2.23)
- -wij
AdTij)
where (2.24)
-av,c,8Tij
avik
-
- aTij -
= -Wijkwo
[
A2k(Tij)
1 [J
Alk(Tij)
-A~k(Tij) Aa(Tij)
y k
James A. Kottas
120
Given the expressions for aVik/dzij,only the spectral cell errors need to be determined in order to compute the adaptation driving term Azij. For an output cell, (2.27)
in general. If the output cell is linear so yi(t) = mivi(t), the spectral cell error simplifies to
6ik = miEjk
(2.28)
where E i k is the set of Fourier series components of the error sequence defined in equation 2.14. However, if the output cell has a nonlinear output function Si, the spectral components of vi(t) are spread across the frequency spectrum. This effect is captured by the term,
(2.29)
The components of this matrix can be calculated using (2.30) (2.31)
(2.32)
(2.33)
where (2.34)
and Si(v) = dSi(v)/dv. Clearly, the nonlinear case involves much more computation than when the cells are linear.
Training Periodic Sequences
121
For a hidden cell (one whose output is not an output of the network), the spectral cell errors can be computed using the recursive backpropagation relationship, T
(2.35)
When the hidden ce is linear, the backpropagation expression can be simplified to (2.36)
Using these expressions, the adaptation driving terms Awij and ATq can be computed for each adaptable interconnect. 3 The Training Process
A training epoch consists of the set of desired training sequences, each repeated n, times, in succession. An example epoch for two training cycles is shown in Figure 1. At the beginning of an epoch, the error gradients Azij(t) are initialized to 0. The first training sequence, having period TI (so TO= TI),is presented for ne - 1 cycles to allow transients to decay away. During the neth cycle, the output error El for the first sequence is computed along with the spectral Coefficients Yik, dYik, and Yik. At the end of this cycle, the gradient information Azij(t) is computed and saved. Next, the second training sequence with period 'T (so TOnow is set to Tz) is applied to the network. After n, - 1 transient cycles, the output error E2 and a new set of spectral coefficients are measured during the n,th cycle. The gradient information for this sequence is calculated and added to the previous Azij(t)from the first training sequence. If there was a third training sequence, the above procedure used for the second sequence would be applied again using this third sequence. After all training sequences have been cycled f i e times and the resulting gradient terms Azij(t ) accumulated, the weights and/or time delays are updated and the total training error is found by summing the individual output errors, Ci Ei. Alternatively, the weights and/or time delays can be adapted immediately after each training sequence. In this case, AzV(t)is not integrated over all sequences but used after each sequence. While this method is not a true gradient descent, it usually converges faster in practice than when the interconnect updates are done only at the end of the epoch.
James A. Kottas
122
One Training Epoch
Figure 1: An example epoch with two training cycles. Note that each cycle is repeated n, = 6 times to allow transients to decay. 4 Training Considerations 4.1 Time Delay Wrap-Around. If there is only one training sequence with period TI,or if all the training sequences have the same period (TI= T2 = ... = TN),Ti, can wrap around the period if the gradient information in ATi,(t) tries to make Tij negative. Then, if Tu does become negative, it can be changed to Tij TO(where TO= TI).
+
4.2 Zero Weights. During the training process, it is quite possible that a weight may have to change sign. However, such a transformation can cause the weight to become zero and stay there for the remainder of the adaptation. This condition can introduce an undesirable local minimum into the error surface by effectively reducing the number of degrees of freedom in the network. The interconnects at risk are those associated with the outputs of the hidden cells. From equations 2.35 and 2.36, the hidden cell errors 6, are proportional to wi,. As wi, + 0, 6, also approaches 0, causing the driving term Awii(t) to become very small. The solution wi, = 0 is the
Training Periodic Sequences
123
trivial solution to the adaptation, since 6, =. Aw,(t) = 0 in this case. A further consequence is that the backpropagation process effectively stops because the cell error is zero. The solution to the problem is to monitor w, and Aw,. When wij is within toof 0 in magnitude and the sign of Awij is such that w, will be adapted toward 0, it is necessary to force wq to "jump" over 0. Assuming to> 0, this can be done using the following algorithm: If 0 5 W, 5 €0 and Awij < 0, set W , = - E O . If
-60
5 Wij 5 0 and AWV> 0, set W ,
= €0.
4.3 Finite Cell Bandwidth. The overall effect of having cells with finite bandwidth is to limit the spectral content of the input/output training sequences. Nonlinear networks have a mechanism for counteracting the attenuation that is unavailable in linear networks. Nonlinear sigmoidal output functions Si can mix the spectral energy in different frequencies via the driving term '7VitYik2 given in equations 2.29-2.33. For frequencies much greater than the cutoff frequency fc = 1/(27rrv), the attenuation is too great for this coupling mechanism to overcome. However, for frequencies around fc, a nonlinear network can learn to compensate for the attenuation via the SBP training algorithm. Since there is no corresponding mechanism in a linear network, it cannot be expected to learn a set of input/output sequences that contain frequencies around or above fc. For reliable learning to occur, the bandwidth of the output sequences must be less than fc, preferably by at least half a decade M 3.16). Another effect of a positive rv is that a phase delay is introduced between the input and output signals of a cell. This delay can be thought of as a form of time delay for propagating signals through the cell. Its influence.in the training process is maintained by the coefficients Alk(t) and AZ(t) defined in equations 2.21 and 2.22, respectively. The amount of the effective delay cannot be controlled because it is determined by rv and the frequency content of the input sequence. However, the interconnect time delays can be used to compensate accordingly. Multiple interconnect time delays must be adapted, though, because the amount of compensation available from any single interconnect is limited. The limit is imposed not by the interconnects but by the spectral content of the input signal-the propagation delay varies between frequencies of the input signal. This situation reveals another benefit of having adaptable time delays in the interconnects.
5 Discrete-Time Considerations
In the continuous-time formulation, the time delays are continuous. However, in a discrete-time simulation of a network, the delays must become
James A. Kottas
124
discrete in some form because of storage limitations. To approximate continuous delays, the function value at a nonintegral delay can be interpolated from neighboring function values at integral delays. For example, suppose Tv = 5.3 time steps and y,[t - Tij] is needed (where t is the current time step). Using linear interpolation,
yj[t - 5.31 x (0.7)yj[t - 51 + (0.3)yj[t - 61
(5.1)
For a general linear interpolation function, Tq can be broken down into an integral part and a fractional part,
+
Tij = LTvJ 6Tij
(5.2)
where LTij) is the largest integer less than or equal to Tij (i.e., the truncation funchon) and 6Tjj is the fractional offset (0 5 6Tij < 1). The form for the general linear interpolator is
The discrete-time equations for the continuous-time cells governed by equations 2.1-2.3 are (5.4) (5.5) (5.6) where a, = ed1Iruand y,[t - Tij] is computed using the above approximation. Note that if all the time delays are 0, each interconnect effectively has a minimum delay of 1 time step. In the remaining discussions, the time delays Tij will refer to the amount of additional time delay in the interconnects. The change in yi(t) over time can be approximated using the backward difference formula,
@ !!
M Ayi[t] = yi[t] -yi[t - 11 (5.7) dt The corresponding Fourier series coefficients for yi[t] are approximated by
(5.9) in which there is an implied At = 1 in the summation. Note that the approximations in these formulas go beyond the integral-to-sum conversion. Depending on TOand K, the sine/cosine basis set of functions
Training Periodic Sequences
125
may not be approximately orthogonal in discrete time because of too few samples in the waveforms. The highest sampling frequency for the sinusoids, KIT0 where TOis the duration of the shortest training sequence, should be larger than the Nyquist rate, 2/To. A factor of 5 is sufficient in practice. The changes in the weights and time delays, governed by equation 2.17, are implemented in discrete time using ~ij[t] = Z q [ t - 11
where
+ (1 - ~$')A~ij[t]
(5.10)
= e-1/7?).
6 Choosing Parameters
There are two parameters that must be chosen a priori: K, the maximum spectral frequency, and n,, the number of cycles per sequence in an epoch. The criteria for choosing K are varied and depend upon the available computation power as well as the duration of the shortest training sequence. The argument for making K large is to capture the details of the shapes for all the training sequences. However, larger values for K require more computation, especially if nonlinear cells are used so the VV*Yik, terms in equations 2.30-2.33 must be calculated. Furthermore, when training with short sequences, K cannot be too large or else the sine/cosine set of basis functions in the Fourier series decomposition ceases to be orthogonal. The net effect is akin to aliasing whereby the spectral coefficients become inaccurate, usually too big. A reasonable guideline is to have at least 10 samples per period of the sine and cosine functions with the highest harmonic. Therefore, if TOis the duration of the shortest training sequence, K should be at most T0/10. The criteria for choosing n, is based on making sure the network is at a steady state (limit cycle or fixed point) in order for the Fourier analysis to be valid. Therefore, the number of transient cycles per sequence in an epoch, n, - 1, must be large enough for all transients to have decayed away. However, as n, increases, the training run time will lengthen because of the increased computational load. The best way to choose n, is with a dynamic algorithm that monitors the network outputs and signals when a steady state has been reached. However, this algorithm could become rather complicated when trying to detect limit cycles. For a fixed value of n, and an arbitrary network topology, the best way to determine n, is empirically. Using reasonable guesses for the weights and time delays, the network can be simulated in the normal runtime mode and its transient response measured. Then, n, can be set to the smallest number of periods that will provide this amount of transient time. In general, it is a good idea to increase this value for n, by 1 or 2 for a safety margin.
James A. Kottas
126
However, the training process still can fail due to insufficient transient time because the exact path of the evolution in weight and time delay space is not known. For example, when training recurrent linear networks, the poles could temporarily migrate close to the imaginary axis, resulting in longer transient times. If n, is not large enough, the adaptation could fail or converge to a false "solution" that incorporates the long transient effects. 7 Sample Simulation Results
The simple recurrent network shown in Figure 2 was used to test the SBP algorithm on an infinite impulse response (IIR) network. Both the weights wI2and wzl and the time delays T12 and 7'21 were allowed to be adaptable. The initial weights were set to 0.5 (with no random perturbations) and the time delays to 0. The input interconnect was fixed with wl0 = 1 and TIO = 0. A single input/output pair of sequences was used for training. The input sequence was the one-dimensional limit cycle defined by x1it] = sin
(T)
(7.1)
where TO = 200 time steps. The desired output sequence was generated by the network by setting rf,= 49.498 time steps (so a, = 0.98), w10 = WZI = 1, w12 = -1, TIO = 0, and T21 = 7'12 = 30 time steps. Both cell output functions were linear with a gain mi = 1. Although using the network to generate the desired output sequence is not indicative of a practical problem, it does ensure that the network has a chance of learning the input/output sequence relationship.
~
Figure 2: A simple linear filter/oscillator with an infinite impulse response.
Training Periodic Sequences
127
A total of five spectral components were computed, so K = 4. Strictly speaking, K could have been 1 since the input sequence contains only one frequency component. The number of cycles per epoch was set to n, = 2, allowing for one transient cycle. To speed convergence, the SuperSAB adaptive gain algorithm (Tollenaere 1990) was employed whereby the common learning gain q in equation 2.17 becomes the set of time-varying gains $)[t]. In this algorithm, the gain for each weight or time delay is increased as long as the error gradient for the particular weight or time delay does not change sign. If it does, the gain is penalized and the gradient driving term is temporarily reset to 0. The resulting evolution of the training error E during adaptation is shown in Figure 3a. The corresponding evolutions of the weights and time delays are shown in Figure 3b and c. The initial and final limit cycles are illustrated in Figure 4. These plots show the successful a d a p tation performed by the SBP algorithm. The oscillations apparent in the evolutions of the total training error and the weights are induced by the gradient resets made by the SuperSAB algorithm when the gradient AT,, [f] changes sign at about t = 270 epochs. Since only one training sequence was used here, the time delays were allowed to wrap around 0 to TO.The delay T12 takes advantage of this ability as illustrated in Figure 3c. If the wrap-around is disabled, the adaptation settles into an unsatisfactory local minimum, thus preventing the training from completing successfully. Other simulations showed that various combinations of the weights and time delays can be adapted. However, if the input interconnect (wl0,Tlo) is allowed to vary along with both sets of weights and time delays, the adaptation path is such that the network develops an eigenvalue that is very close to 1 in magnitude. The resulting long time constant prevents the network from reaching a steady state within the allotted transient time. Coincidentally, the adaptation converges to a solution, but this solution is not the desired one because it cancels out the long transient response. For the correct solution to be obtained, ne would have to be increased significantly (two orders of magnitude in this particular case). 8 Discussion
The same network in Figure 2 was trained on two input/output limit cycles with different fundamental periods. As before, the network was used to generate the output cycles given two input cycles. The SBP successfully trained the network to learn the new set of weights and time delays. In this case, the time delays could not wrap around TOsince TOwas not the same for both cycles. Several other cases have been tested with the SBP algorithm (Kottas 1991). It has been used to train a nonperiodic impulse response h[t]
128
JamesA. Kottas
Figure 3 The evolution of the weights and time delays for the network shown in Figure 2. (a) The total training error E . (b)The weights. (c) The time delays. The negative delays indicate wraparound has occurred. The actual delay is given by T12 + TOwhen T12 < 0. TO= 200 time steps here.
Training Periodic Sequences
129
Figure 4 (a) The initial (before training) and desired (final)output limit cycles. (b) Snapshots of 8 output cycles taken during the training process with 100 epochs between cycles. into a conventional finite-impulse-response (FIR) filter network. Sharp transitions in h [ f ]can only be approximated because of the finite number of Fourier components used in the training. The SBP algorithm also can train the time delays in a FIR filter network. If all weights are held constant at 1, the resulting network realizes a phase-only FIR filter. If the time delays are permitted to wrap around To in this case, the SBP algorithm will not necessarily find the optimal solution, that is, one with a minimum of time delay. However, the time delays can be modified easily by subtracting off unnecessary delays such that the relative delays between the FIR taps are preserved. Finally, the SBP algorithm can train vector patterns and not just sequences into a network. For example, it successfully trained a feedforward network with one hidden layer to learn the exclusive-OR operation between two binary inputs. In this case, only the weights were adaptable and the time delays were set to 0. Furthermore, only the bias component (K = 0) was used for training. The number of cycles per sequence (a binary vector here) per epoch was set to n, = 3 to allow the discrete time simulation to completely propagate an input pattern through the network and produce the corresponding output pattern. In all these cases, the SBP algorithm requires that a steady state solution exists for each input. If the network is inherently unstable such as with a recurrent linear network with eigenvalues greater than one in magnitude, the SBP will not work and thus will not be able to restore stability to the network.
130
James A. Kottas
Adaptable gains (@[ t]) can significantly decrease the convergence time, especially when the weights and time delays select a point near a quadratic well in the error surface. However, the gradient-reset step in the SuperSAB algorithm often induces oscillations in the adaptation. A simple solution is to ignore this part of the SuperSAB algorithm and let the SBP algorithm continue from the current point (as opposed to the previous point) on the error surface. Compared with the conventional recurrent backpropagation algorithm, the SBP algorithm requires more computational resources. However, the added computational complexity is small compared with the ease in generating networks for solving certain problems. Consider the task of recognizing limit cycles. Using the SBP algorithm, it is possible to train very quickly two linear networks, both with no hidden cells, to recognize arbitrarily shaped limit cycles in less than three periods with a high degree of phase sensitivity. This application is discussed more fully in the context of a finite state machine based on limit cycles in Kottas (1991). 9 Conclusions
The spectral backpropagation (SBP) algorithm can train the weights and time delays in a network given a set of periodic input/output sequences. Its adaptation criterion is based on minimizing the spectral energy in a Fourier series decomposition of the output error sequences. A useful property of the SBP algorithm is that it allows the cells in the network to have finite bandwidth. The SBP algorithm has been demonstrated successfully with continuous-time limit cycles and recurrent networks (varying combinations of the weights and time delays), nonperiodic discrete-time sequences and FIR networks (varying the weights and time delays independently), and static vector patterns in feedforward networks (weights only). Since it requires steady-state behavior to exist in the network, the SBP algorithm cannot train inherently unstable networks. The main advantage of having trainable time delays is that it allows the phase characteristics of the network to be controlled more easily than with the conventional recurrent backpropagation algorithm. This feature is desirable when designing networks to process periodic signals. Acknowledgments This work was sponsored in part by DARF'A through Grant AFOSR-860301. I would like to thank Vernon Shrauger, Thomas McNamara, and the reviewer for their useful comments on the manuscript. I also would like to acknowledge several helpful discussions with Cardinal Warde and Michael Jordan.
Training Periodic Sequences
131
References Jordan, M. I. 1989. Supervised learning and systems with excess degrees of freedom. In Proceedings of the 1988 Connectionist Models Summer School, G . Hinton, D. Touretzky, and T. Sejnowski, eds., pp. 62-75. Morgan Kaufmann, San Mateo, CA. Kottas, J. A. 1991. Limit cycles in neural networks for information processing. Ph.D. thesis, MIT. Pearlmutter, B. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1,263-269. Pineda, F. 1989. Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Comp. 1, 161-172. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing. Vol. 1: Foundations, D. E.Rumelhart, J. L. McClelland, and the PDP Research Group, eds., pp. 318-362. The MIT Press, Cambridge, MA. Tollenaere, T. 1990. SuperSAB Fast adaptive back propagation with good scaling properties. Neural Nehvorks 3,561-573. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989.Phoneme recognition using time-delay neural networks. IEEE Trans.ASSP 37,328-339. Yao, Y.,and Freeman, W. J. 1990. Model of biological pattern recognition with spatially chaotic dynamics. Neural Networks 3, 153-170. Received 3 September 1991; accepted 5 May 1992.
This article has been cited by:
Communicated by Halbert White
Generalization and Approximation Capabilities of Multilayer Networks Yoshikane Takahashi NTT Network Information Systems Laboratories, 1-2356 Take Yokosuka-shi,Kanagawa 283-03, Japan
This paper develops a theory for constructing 3-layered networks. The theory allows one to specify a finite discrete set of training data and a network structure (minimum intermediate units, synaptic weights and biases) that generalizes and approximates any given continuous mapping between sets of contours on a plane within any given permissible error.
1 Introduction A real object world usually consists of a continuum each point of which also consists of a continuum, for example, objects such as pictures and hand-written characters. Humans perceive the whole object world continuously both inter-object and intra-object from a certain finite discrete set of perceived data. These human perception capabilities (generalization and approximation) are represented by a continuous mapping between the object worlds. Multilayer networks are expected to simulate these generalization and approximation capabilities in an effective manner. Ji et al. (1990) modified the backpropagation learning algorithm for 3layered networks to generalize continuous curves from the set of integer points. However, they do not further investigate specific construction methods of optimal training data and network structures (minimum intermediate units, synaptic weights and biases) to generate neural mappings that generalize continuous curves, which is left as an open problem (the generalization problem). It has been proven that any continuous mapping from an m-dimensional real space to an n-dimensional real space can be approximated within any given permissible distortion by the 3-layered network with enough intermediate units (Hornik et al. 1989; Funahashi 1989). However, this result assures just the network existence and provides no construction methods for optimal network structures to generate neural mappings Neural Computation 5,132-139 (1993) @ 1993 Massachusetts Institute of Technology
Capabilities of Multilayer Networks
133
that approximate the continuous mapping; this is also left as another open problem (the approximation problem). This paper gives a theoretical solution to both the generalization and the approximation problems. That is, it develops a theory to construct a set of training data and 3-layered network structure that generates a neural mapping which generalizes and approximates any given continuous mapping between sets of contours on a plane within any given permissible distortion. 2 Problem Formulation
2.1 Definitions. 2.1.1 Metric Spaces of Contours. Let x denote a contour on the plane.
where I = [O, 11. Set C ( I )= { x I x : contour}. C(I) is a metric space with the distance d: d ( f , x b ) = max I f ( t ) - xb(t)l for V f , V xb E C(I) tE1
t EI
(2.2)
Ie(t)
for V t E I. where IP(f)- Xb(t)l = Ci - $(t)I Assume that X and Y denote any compact subspaces of C(I). 2.1.2 Continuous Mappings. Define a continuous mapping f from X
onto Y f : X Y such that V ( x ) ] ( t ) = y ( t ) and f is a restriction of a mapping F : C(I) + C(I) that satisfies these double continuity conditions: (2.3) a. For each x in C(I),F ( x ) and F-'(x) are continuous on I. b. F and F-' are continuous on C(I). Examples off include smooth deformations on contours such as rotation and similarity extension. Any one continuous mapping f is fixed and discussed. 2.2.3 Three-layered Networks. Each constituent of the 3-layered network 52 is defined as follows. 1. Inputs and outputs are values of contours x in X and y in Y respec-
tively.
Yoshikane Takahashi
134
2. The input layer has two units. Outputs from unit i are represented by the set { x , ( t ) I t E I} (i = 1,2). 3. The intermediate layer has r units with r in 1 5 r 5 YM where YM is assumed to be a fixed natural number. The mapping function of unit j is defined as: r
1
(2.4) where cp is any bounded, monotone increasing and continuously differentiable function. Aij is a synaptic weight from unit i to unit j with the condition [Aij[ 5 A (A: positive constant). 0, is a bias at unit j with the condition 0 5 j 5 0 (0: positive constant) (1 5 i 5 2, 1 5 j 5 r). All these mapping functions altogether are expressed as z ( t ) = cp[Ax(t) - 191. 4. The output layer has two units. The mapping function of unit k is defined as
(2.5) where c,k is a synaptic weight from unit j to unit k with the condition lcpl 5 C (C: positive constant) (1 5 j 5 r, 1 I k 5 2).
5. Assume that YM and cp are known and fixed while T, A, 19, and c are unknown variables. r = (r, A, 6, c) is called a structure of 52. 2.2 Generalization and Approximation. Let any structure r be given. Then r generates a neural mapping [ defined as [: C(I) + C(I) such that
y(t) = [ [ ( x ) ] ( t )= cp[A x ( t ) - 291 for V x
E C(I),y E
2.2.1 Generalization. Assume any nonnegative real number given. Define XN x YN as a finite discrete subspace of X x Y
XN x YN n
=
(2.6)
C(I) E
1 0 is
{(x",y") I x" E X,y" E Y,y" =f(x"),
= 1,2,..., A!}
(2.7)
Also define A as a partitioning of I: A = {P E O = t O < t' t9"
r 1 4 = 0, I , . . . , K }
< ... < P < ... < p-' < t K = I - P = 1 / K for 9 = O , l , . . . , K - 1
(2.8)
Yoshikane Takahashi
136
3 Solving GAP
GAP is solved in three steps, each of which is described in each section. 3.1 Tkaining Data Construction. As it is assumed that X and Y are compact in C(Z) and f is continuous, the subspace XN x YN of X x Y in 2.7 is constructed as V x E X , 3n : d ( x ,x") < &/40r~Ch M,
and d [ f ( x ) , f ( x " )< ] &/5 (3.1)
As f is continuous, a positive real number 6 > 0 is found such that
d V ( P ) , F ( x b ) ]< &/loif d(x@,xb)< 6 where V P E XN,Vxb E C(Z)
(3.2)
Applying the Weierstrass plynomial approximation theorem to each continuous function x" E X ,a polynomial x p E C(Z) is constructed with M reselected if necessary so that
d(x",XP) < &/40r~CA M <6
(3.3)
Set X p = {x" I xp: A polynomial that satisfies 3.3. x" E XN} c C(Z). Thus, 3.2 and 3.3 lead to
Applying again the Weierstrass polynomial approximation theorem to each continuous function F(xp) E Y, a polynomial Fp(x") E C(Z) is constructed such that
d[F(A?),FP(g)] < &/lo
(3.5)
Thus, 3.4 and 3.5 lead to dV(X"),FP(XP)I< 4 5
(3.6)
As XN x YN is a finite subspace of X x Y,by selecting an appropriately fine partitioning of Z in 2.8, a set of training data expressed in 2.9 is constructed for any x" E XN so that
n
V t E I,3 3 E Xp,IFp($) E C(Z), 3q : IA?(t) - A?(fl)I < &/4oY~Ch M for v t E [t4,fl+']
(3.7)
IIFP(?)](t) - [Fp(A?)](fl)l< &/5 for V t E [ f l , f l + l ]
(3.8)
Capabilities of Multilayer Networks
137
3.2 Structure Construction. This section will construct a structure r0 with minimum units ro to realize within the distortion &/lONK. It is a necessary condition for a structure realizing withixi the distortion &/lONKto have the minimum units ro that the structure generates a mapping from the input layer to the output layer that extracts the minimum information from II with the distortion E/lONK (Bichsel and Seitz 1989). Let denote the data set that extracts the minimum information from with the distortion &/lONK.In general, the rate distortion theorem in the information theory establishes the information extraction method that preserves the minimum information from any information space (typically a source information space) to any other information space (typically a compressed information space) within any given distortion between the two information spaces (e.g., Berger 1971). Thus, is constructed by the aid of the rate distortion theorem, and is represented as follows:
n
n
n' n
n'
n'
= {(xW,w"9) I (x",w")E
xN x YN,x"9 = x"(t9),
w"'? = w"(P),fqE A
n Here
=
1,2,..., N ; q = O , l ,
...)K}
(3.9)
n' satisfies the condition that Iw"(f4) -yn(tq)l 5 &/lo for V y" E YN,VP E A
(3.10)
n'.
Next, let roexactly realize Then, the components A, 6, and c of l?o must satisfy the following equation system: r
1
(3.11)
w;(P)= c c j k z ; ( P ) (k = 1,2)
(3.12)
i
The structure r0 is constructed by solving the equation system 3.11,3.12 with a sequence of three theoretical procedures briefly described as follows.
Procedure I. The equation system 3.12 is solved. Set AI(C) = (6 1 6: A real number such that 161 IC}. Then, for any fixed c E AI(C)~'Mthe system 3.12 comes to a linear equation system consisting of 2 N ( K 1) equations with the 2 N ( K + 1 ) known constants wi(ts) and the r ~ N ( K f 1 ) unknown variables zy(F). This linear equation system is solvable in zy(trl) under appropriate parametric constants c E A,s(C) L AI(C)~'M.Solving it, with arbitrary constants assigned to some zy(ts)'s if necessary, produces the set Z ( c ) = { z ( c ) 1 z(c): A solution of 3.12 under c E A,s(C)}.
+
Procedure 2. The equation system 3.11 is solved (a similar solution procedure is proposed by Shepansky 1988). Set A,(@) = ( 6 I 6: A real number such that 0 I6 5 8).Then using the inverse function cp-' of
Yoshikane Takahashi
138
cp, for any fixed 29 E Az(Q)'M the system 3.11 comes to a linear equation system as follows consisting of rMN(K + 1) equations with the known constants xr(t4) of 2 N ( K l), cp-'[27(t4)] of rMN(K 1) and dj of r M and the 2rMunknown variables Xi,. (3.13) C ~ i j x y ( t s=) cp-'[zy(tq)I dj (j = 172,.. * rM)
+
i
+
+
9
The system 3.13 is solvable in Xij under appropriate parametric constants c E Ais(C) & A,s(C), Z ( C ) E Z(c)' C Z(c), and 19 E A2.40) C Solving it produces the set Ass(A) = {X(c,d) 1 X(c,d): A solution of 3.13 under c E A&(c)/ z(c) E Z(c)', and 19 E A ~ s ( 0 ) n ) A,(A) where A3(A) = { A I A: A real number such that 1x1 5 A}. Procedure 3. The intermediate unit number rM is minimized. Pick out all combinations of X(c,d) E AJS(A),6 E Azs(0) and c E A;,(C) that at the same time include the subvectors (Xlj, Xzj) = (O,O), dj = 0, and (cj1,cjz) = (0,0), respectively for some j . Eliminating such a null intermediate unit j or units j's from each combination [X(c, d ) , 8, c] produces a structure r*= [r, X(c, d)*, d*,c'] that includes no null intermediate unit any longer. Then, select any one structure ro = (ro, X(c, d ) * ,d*, c*] that has the minimum number ro of the intermediate units among all these structures r*. 3.3 Conditions Satisfaction. It is assured that the pair (fl,ro)constructed in Sections 3.1 and 3.2 in fact satisfies the conditions (CON11 and (CON21 in GAP. It is evident from the construction of r0 in Section 3.2 that (CON2) is satisfied. Consider next (CONI). rogenerates the neural mapping [ in 2.6 that satisfies the conditions:
3.3.1 Generalization Condition (2.10). As [ maps x"(t4) to w"(t4) where [xn(t9),w"(t4)]satisfies 3.14, it holds that [ [ ( x " ) ] ( P= ) w"(fl) for n = 1,2,. . . ,N and q = 0,1,. . . ,K (3.15) Using 3.15, the left-hand side of 2.10 is evaluated as I[E(x)l(t)- Y"(P)l 5 1[5(x)l(t)- [t(X")l(fl)l + Ilw"(fl) - Y"(fl)I (3.16) Owing to 2.14,3.1,3.3, and 3.7, the first term of the right-hand side of 3.16 is evaluated as
I[<(x)](t)- [r(x")l(fl)I 5 rMCA Mlx(t) - X " ( f l ) l 5 rMCA M [ d ( x ,x") + d(x",?) + IXP(t) - fl(P)l +d(?,x")] < (1/40)[~+ E + E f E ] = &/I0
(3.17)
Capabilities of Multilayer Networks
139
Thus, 2.10 is obtained from 3.10, 3.16, and 3.17. 3.3.2 Approximation Condition (2.11). The left-hand side of 2.11 is evaluated as
IV(x)l(t) - K(x)l(t)l I dV(x),f(x”)l + dV(x”)7FP(fl)I + I[FP(2)l(t)- [FP(fl>l(fl)1 + d [ F P ( g ) , f ( ~ ”+) ]Iy”(t4)- [t(x)](t)I (3.18) Thus, 2.11 is obtained by applying the evaluations 3.1, 3.6, 3.8, and 2.10 to the right-hand side of 3.18. 4 Conclusion
This paper develops the theory to solve GAP, the generalization and approximation problem of the 3-layered networks in the application of contour perception on the plane. Compared to the works such as Ji et al. (1990), and Hornik et al. (19891, the theory in this paper is much more elaborate and constructive and thus applicable to the real world objects perception. Further, the theory provides more specific and direct insights into the training data and structure to GAP, which elucidates the generalization and approximation capabilities of the 3-layered networks. The theory needs to be extended to cover more complex cases such as general multilayer networks applied to the 3-dimensional objects perception. As for the practical aspect of GAP, the theory in this paper is expected to stimulate the development of effective algorithmic solutions to GAP together with their computational complexity estimation. References Berger, T. 1971. Rate Distortion Theory. Prentice-Hall, Englewood Cliffs, NJ. Bichsel, M., and Seitz, I? 1989. Minimum class entropy: A maximum information approach to layered networks. Neural Networks 2, 133-141. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2, 359-366. Ji, C., Snapp, R., and Psaltis, D. 1990. Generalizingsmoothnessconstraints from discrete samples. Neural Cornp. 2, 188-197. Shepansky, J. 1988. Fast learning in artificial neural systems: Multilayer percep tron training using optimal estimation. In J. R. Johnson, ed., Proceedings of the Aerospace Applications of Art$cial Intelligence Conference, Netrologic, Vol. 1, pp. 85-93. Dayton, OH. Received 25 February 1992; accepted 5 May 1992.
This article has been cited by: 2. G. Wolfe, R. Vemuri. 2003. Extraction and use of neural network models in automated synthesis of operational amplifiers. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22:2, 198-212. [CrossRef] 3. Kyu-Cheoul Shim, Darrell G. Fontane, John W. Labadie. 2002. Spatial Decision Support System for Integrated River Basin Flood Control. Journal of Water Resources Planning and Management 128:3, 190. [CrossRef] 4. Allan Pinkus. 1999. Approximation theory of the MLP model in neural networks. Acta Numerica 8, 143. [CrossRef] 5. Kuo-lin Hsu, Hoshin Vijai Gupta, Soroosh Sorooshian. 1995. Artificial Neural Network Modeling of the Rainfall-Runoff Process. Water Resources Research 31:10, 2517. [CrossRef]
Communicated by Gerald Tesauro
Statistical Theory of Learning Curves under Entropic Loss Criterion Shun-ichi Amari Noboru Murata Deparfmenf of Mathemafical Engineering and Infomafion Physics, University of Tokyo, Bunkyo-ku, Tokyo 113, Japan
The present paper elucidates a universal property of learning curves, which shows how the generalization error, training error, and the complexity of the underlying stochastic machine are related and how the behavior of a stochastic machine is improved as the number of training examples increases. The error is measured by the entropic loss. It is proved that the generalization error converges to Ho,the entropy of the conditional distribution of the true machine, as HO+ m * / ( 2 t ) , while the training error converges as HO- m'/(2t), where t is the number of examples and m' shows the complexity of the network. When the model is faithful, implying that the true machine is in the model, m* is reduced to m, the number of modifiable parameters. This is a universal law because it holds for any regular machine irrespective of its structure under the maximum likelihood estimator. Similar relations are obtained for the Bayes and Gibbs learning algorithms. These learning curves show the relation among the accuracy of learning, the complexity of a model, and the number of training examples. 1 Introduction
It is an important subject of research of neural networks and machine learning to study general characteristics of learning curves, which represent how fast the behavior of a learning machine is improved by learning from examples. It is also important to evaluate the performance of a trained machine in terms of that for the old training examples. This is given by the relation between the generalization error and the training error, in terms of the complexity of the network. This is an interdisciplinary problem related to neural networks, machine learning, algorithms, statistical inference, etc. There are a number of approaches to learning machines. One is the stochastic descent learning algorithm (see, e.g., Widrow 1966;Amari 1967; Rumelhart, Hinton, and Williams 1986;White 1989). Even in an old paper by Amari (1967) where the stochastic descent method was proposed for Neural Computation 5,140-153 (1993) @ 1993 Massachusetts Institute of Technology
StatisticalTheory of Learning Curves
141
general layered neural networks, the asymptotic dynamic behavior of learning curves was discussed, and the trade-off between the learning speed and the accuracy was studied [see Heskes and Kappen (1991) for recent developments]. Another approach is a computational one (Valiant 1984) in which the learning performance was evaluated stochastically under computational complexity constraints on algorithms. This approach was successfully applied to neural networks (Baum and Haussler 1989). Haussler et al. (1988) studied the convergence rate of general learning curves by relaxing algorithmic constraints. See also Haussler et al. (1991) for recent developments. Here, the VC dimension plays a major role. Yamanishi (1990,1991) extended the framework to noisy or stochastic machines. The third approach is statistical-mechanical. Levin et al. (1990) presented a Bayesian statistical-physical approach to study learning curves, where behaviors of generalization errors, predictive-entropic errors, and stochastic complexity of Rissanen (1986) were discussed. There are also a number of papers using a statistical-mechanical approach to this problem [see, for example, Hansel and Sompolinsky (1990); Gyorgyi and Tishby (1990); S u n g et al. (1991); +per and Haussler (1991)l. The statisticalmechanical approach can give some deep theory for specific simple models such as the simple perceptron, in which the replica method is typically used in the "thermodynamic limit" situation. The present paper uses the fourth approach of statistical inference to elucidate the asymptotic learning behavior of a general stochastic learning dichotomy machine. The predictive entropic loss is used for evaluating the machine performance, where the maximum likelihood estimator of the Bayes and the Gibbs algorithms is used to choose a candidate machine based on training examples. The statistical approach is based on the asymptotic expansion of estimators [see, e.g., Amari (1985) for the higher order asymptotic expansion]. Before comparing the results of the present paper with others, we state the problems treated here and the main results. We consider a stochastic machine or stochastic multilayer neural network parameterized by a vector parameter w,which, when an input x is given, emits a binary ouput y with probability p(y I x, w). Suppose we are given t examples = { (yl, X I ) , . . . , (yt, x,)}, where xi is randomly generated from a fixed but unknown probability distribution p(x) and yi is a corresponding output generated by the true machine that has parameter WO.The maximum likelihood estimator wf is calculated as a candidate machine in the beginning. This machine predicts an output y for given x by the predictive distribution p(y I x,wf). There are two different methods of evaluating the behavior of a machine. One is the average error rate at which the candidate machine predicts an output different from that of the true machine. The other is the average predictive entropy evaluated by the expectation of -logp(y I x,wf)for an input-output pair (x,y), which is zero if the prediction is 100% correct. We use this entropic
<,
142
Shun-ichi Amari and Noboru Murata
loss to evaluate the learning behavior of a machine (see also Yamanishi 1991).
The generalization error is the average entropic loss, or average predictive entropy, of a trained machine for a new example (yr+1,xt+l). It is proved that the average predictive entropy for the generalization error (e(t))eenconverges to the entropy HOof the true machine asymptotically as in the following Theorems, where ( ) denotes the expectation and rn is the number of parameters in w. This is in agreement with Yamanishi’s result (Yamanishi 1991). On the other hand, the training error (e(t))bh is the average entropic loss of the candidate machine for the training examples (yi, xi), i = 1, . . . ,t, which are used to estimate wt. It is proved that the training error also converges as in the Theorems. Theorem. Universal Convergence Theorem.
Since HOis unknown, we can obtain (e(f))genby the relation
This is in good agreement with the AIC approach (Akaike 1974). Instead of using the maximum likelihood estimator wt, we can use the Bayes approach. When the behavior of a trained machine is evaluated by the Bayes posterior distribution (the Bayes algorithm), the learning curves are exactly the same as the previous Theorem. When we choose a candidate machine from the posterior distribution [the Gibbs learning algorithm (Opper and Haussler 1991)1, we obtain the following result. Theorem. Bayesian Convergence Theorem.
for the generalization error, and
for the training error. The above results hold under the assumption that there exists woby which the true machine is specified. However, in many cases there is no wo that specifies the true machine. The model is said to be unfaithful in this case. Let w; be the best approximation to the true machine in
Statistical Theory of Learning Curves
143
the sense of Kullback divergence and let H,' be its entropy By using the maximum likelihood estimator, we prove the following theorem, where
m' = tr(K'-'G') to be defined later plays a role of the effective dimensions. Theorem. Convergence Theoremfor Unfaithful Model.
m* ( e ( t ) ) b h = H* - 2t Now we compare our methods and results with others. The l / t convergence law was first proved by Haussler et al. (1988).However, its coefficients were not exactly known. Their exact values are still unknown even for the simple perceptron in the case of the error rate loss (Haussler et al. 1991). By using the entropic loss, the Theorem proves the universal coefficient of the convergence rate. This is universal in the sense that the theorem holds irrespective of the machine architecture. This implies that the VC dimension seems to be irrelevant for stochastic machines. The statistical-mechanical approach is useful for determining the coefficient for the l / t convergence. However, it uses the replica method which is unjustified. Moreover, it is applicable only to simple models like the simple perceptron and only in the case of the thermodynamic limit, implying that both t and m tends to infinity with a fixed ratio a: = t/m. Our method does not use the statistical-mechanical assumptions such as the replica method, annealed approximation, and the thermodynamic limit. Instead, we use the standard technique of asymptotic statistical infednce, which is valid under the regularity conditions such as the existence of the moments of random variables and the existence of the Fisher information. The statistical technique is not applicable to deterministic machines, because they violate the regularity conditions. Therefore, the present paper complements the result by Amari et al. (1992)where learning curves are obtained for deterministic machines under the annealed approximation. Amari (1992)succeeded in obtaining a similar result for deterministic machines without the annealed approximation. The present results are closely related to the model selection by AIC (Akaike 1974)and its generalization to general nonlinear neural networks (Murata et al. 1991;Moody 1992). The first Theorem can be regarded as a detailed version of the original AIC, while the third Theorem corresponds to its generalization. Moody (1992)proposes a similar generalization of AIC under a more general loss criterion in an unfaithful model. This approach is more general in the sense that it includes a regularization term, but is less general than Murata et al. (1991)in the sense that the latter treats a more general nonlinear model including non-additive noises. It should be pointed out that these papers give essentially the same effective number m* of parameters, although they are different in their expressions.
Shun-ichi Amari and Noboru Murata
144
2 Statistical Theory of Stochastic Machines
Let us consider a machine which receives an n- dimensional input signal x E R" and emits a binary output y = 1 or -1. A machine is stochastic when y is not a function of x but y takes on 1 and -1 subject to a probability p(y I x) specified by x. Let us consider a parametric family of machines where a machine is specified by an m-dimensional parameter w E R" such that the probability of output y, given an input x, is specified by p(y I x, w). A typical form of p(y I x, w) is as follows. A machine first calculates a smooth function f(x,w) and then specifies the probabilities by
where
k ( f )=
1 1+e-A
(2.2)
When f(x, w) > 0, it is more likely that the output of the machine is y = 1, and when f(x, w) < 0, it is more likely that the output is y = -1. The parameter l / p is the so-called "temperature" parameter. When p = 00, the machine is deterministic, emitting y = 1 when f(x, w) > 0 and y = -1 when f(x, w) < 0. Let us consider the case where the true machine that generates examples is specified by WO. More specifically, let p(x) be a nonsingular probability distribution of input signals x, and let x l , . .. ,xt be t randomly and independently chosen input signals subject to p(x). The true machine generates answers yl, . . . ,yt using the probability distribution p(yi I xi,wo), i = I , . . . , t. Let be t pairs of examples thus generated,
. . .,(xt,yt)}
(2.3)
from which we guess the true machine. Let wtbe the maximum likelihood estimator from the observed data It. Since the probability of obtaining & from a machine specified by w is t
I W) = nP(xi)P(Yi I xi, W)
~ ( < t
i=l
by taking the logarithm,
should be maximized by the maximum likelihood estimator wt, where
l(Y I x, w) = logP(Y I x, w)
(2.4)
StatisticalTheory of Learning Curves
145
3 Generalization Error and Training Error in Terms of the Predictive
Distribution Given t examples &, we estimate the true parameter with wt.The behavior of the estimated machine is given by the conditional probability p(y I x, wt).Given the next example xt+] randomly chosen subject to p(x), the next output yt+l is predicted with the probability p(yt+l 1 x t + l , wt). The best prediction in the sense of the minimum expected error is that the predicted output y;+] is 1 when p(1
I xt+r,*t)
> p(-l
I Xt+l,Wt)
and is -1 otherwise. The prediction error is given by ut = 0.51yt+l-y;+]I.
This is a random variable depending on the t training examples tt and Xt+l*
Its expectation (uJgen with respect to tt and xt+l is called the generalization error, because it denotes the average error when the machine trained with t examples predicts the output of a new example. On the other hand, the training error is evaluated by the average of ui(i = 1,. . .,t), which are the errors when the machine wtpredicts the past outputs yi for the training inputs xi retrospectively, using the distribution p(yi I xi, wt), that is
This error never converges to 0 when a machine is stochastic, because even when wtconverges to the true parameter wothe machine cannot be free from stochastic errors. The prediction error can also be measured by the logarithm of the predictive probability for the new input-output pair (yt+l,xt+l),
e(t) = - logp(yt+l I
xt+l, f i t )
(3.1)
This is called the entropic loss, log loss or stochastic complexity (Rissanen 1986; Yamanishi 1991). The generalizationentropic error is its expectation over the randomly generated training examples ttand new input-output pair (xt+~,YI+I 1, (e(t))gen
= -(logp(yt+1 I xt+l,wt))
(3.2)
Since the expectation of - logp(y I x) is the conditional entropy,
the generalization entropic loss is the expectation of the conditional entropy H(Y 1 X;wt) over the estimator wt. The entropic error of the true
Shun-ichi Amari and Noboru Murata
146
machine, specified by WO,is given by the conditional entropy,
Ho = H(Y I XWO)= El- logp(y I X,WO)l
(3.3)
Similarly, the training entropic error is the average of the entropic loss over the past examples (yjl Xi) that are used to obtain wt, (3.4) Obviously, the training error is smaller than the generalization error. It is interesting to know the difference between the two errors. The following theorem gives the universal behaviors of the training and generalization entropic errors in a faithful model, that is, when there is a wo specifying the true machine. Theorem 1. Universal Convergence Theorem for Training and Generalization Errors. Theasymptotic learning curve for the entropic training error is given by
and for the entropic generalization error by
where m is the number of parameters in w. The result of l / t convergence is in good agreement with the results obtained for another model by the statistical-mechanical approach (e.g., Seung et al. 1991). It is possible to compare our result with Yamanishi (19911, where the cumulative log loss,
is used. Here wjis the maximum likelihood estimator based on the i observations ti. From (3.61, we easily have
in agreement with Yamanishi (19911, because of
cI 71
= log t
+ o(l0g t)
i=l
The proof of Theorem 1 uses the standard techniques of asymptotic statistics and is given in the Appendix.
Statistical Theory of Learning Curves
147
4 Learning Curves for Unfaithful Model
It has so far been assumed that there exists wosuch that the true distribution p(y I x) is written as
P(Y I 4
I
= P(Y %WO)
(4.1)
This implies that the model M = {p(y I x,w)} of the distribution parameterized by w is faithful. When the true distribution is not in M, that is, there exisits no wosatisfying (4.11, the model M is said to be unfaithful. We can obtain learning curves in the case of unfaithful models, in a quite similar manner as in the faithful case. Let p(y I x,w:) be the best approximation of the true distribution p(y I x) in the sense that w; minimizes the Kullback-Leibler divergence
where the expectation E is taken with respect to the true distribution
p(x)p(y 1 x). We define the following quantities: H,' = E[- logp(y I x,will G* = E[{Vl(Y I x,w,')l{vr(y I X,Wi)}T1 K* = -E[VVl(y I X, w:)]
(4.2)
(4.3) (4.4)
where V is the gradient operator, V implying the column vector
vl=
(&)
the suffix T denotes the transposition of a vector, and VVl is the Hessian matrix. In the faithful case, w: = wo,H,'= Ho,and G' = K' = G is the Fisher information matrix. However, in general, G' # K* in the unfaithful case.
Theorem 2. Convergence Theorem for Learning Curves : Lhfiithful Case. The asymptotic learning cume for the entropic training error is given by m* (e(t))@&= H,'- (4.5) 2t and for the entropic generalization error by m* (e(t))@&= H,'+ (4.6) 2t where m' = tr(K*-'G') is the trace ofK*-'G'. See the Appendix for the proof. It is easy to see that m* = m in the faithful case, because of K' = G*.The above relations can be used for selecting an adequate model (see Murata et al. 1991; Moody 1992).
148
Shun-ichi Amari and Noboru Murata
5 Bayesian Approach
The Bayesian approach uses a prior distribution 9(w),and then calculates the posterior probability distribution Q(w I 5') based on t observations (training examples). The predictive distribution based on Ct is defined by (5.1) P(Y I x;ttt) = JP(Y I x,w)Q(w I W w One idea is to use this predictive distribution for predicting the output. Another idea is to choose one candidate parameter w; from the posterior distribution Q(w I &) and to use p(y I x; w;) for predicting the output. The former one is called the Bayes algorithm and the latter is called the Gibbs algorithm (Opper and Haussler 1991). The entropic generalization loss is evaluated by the expectation of - log p(y I x; &) for a new example (y, x) in the Bayes algorithm case and the expectation of - log p(y I x; w;) in the Gibbs algorithm case. The entropic training loss is given, correspondingly, by
We first study the case of using the predictive distribution p(y I x; 4). By putting (5.2)
the predictive distribution is written as (5.3) p(yt+1 I X'+1, 5') = Zt+l/Z' [Amari et al. (1992); see also the statistical-mechanical approach, for example, Levin et al. (1990); S u n g et al. (1991); Opper and Haussler 1991)l. Therefore, (5.4)
We can evaluate these quantities by statistical techniques (see the Appendix). Theorem 3. The learning curves for the Bayesian predictive distribution are the same as those for the maximum likelihood estimation. We can perform similar calculations in the case of the Gibbs algorithm. Theorem 4. The learning curves for the Gibbs algorithm is for the training error (e(f1)tIain = Ho and for the generalization error
(5.5)
(5.6)
Statistical Theory of Learning Curves
149
Conclusions We have presented a statistical theory of learning curves. The characteristics of learning curves for stochastic machines can easily be analyzed by the ordinary asymptotic method of statistics. We have shown a universal l / t convergence rule for the faithful and unfaithful statistical models. The difference between the training error and the generalization error is also given in detail. These results are in terms of the entropic loss, which fits very well with the maximum likelihood estimator. The present theory is closely related with the AIC approach (Akaike 1974; Murata et al. 1991; Moody 1992) and the MDL approach (Rissanen 1986). Our statistical method cannot be applied to deterministic machines, because the statistical model is nonregular in this case, where the Fisher information diverges to infinity. However, we can prove
for the entropic loss without using the annealed approximation (Amari 1992). But this does not hold for the expected error ut. Appendix: Mathematical Proofs In order to prove Theorem 1, we use the following fundamental lemma in statistics. Lemma. The maximum likelihood estimator wt based on t observations ttis asymptotically normally distributed with mean woand covariance matrix (tG)-',
C )
W , N N wo,-G-'
(A11
where wois the true parameter and G = (gij) is the Fisher information matrix defined by
where E denotes the expectation with respect to the distribution p(x)p(y I x, wo).
When the probability distribution is of the form (Z.l), the Fisher information matrix can be calculated to be
(see Amari 1991). This shows that G diverges to 00 as the temperature tends to 0, the estimator wtbecoming more and more accurate.
Shun-ichi Amari and Noboru Murata
150
Proof of Theorem 1. In order to calculate ( W g e n = -E[logp(y
I x,Wt)l
we expand
I(y I x, W:)= logp(y I x, Wt) at WO,giving
I(y I x,Wt) = qy I x,wo) + Vl(y I x,wo)(Wt - wo) 1 +i(Wt - wo)TVVl(y I x, wo)(W, - wo) + * * *
(A3)
where VI is the gradient with respect to w, VVI = ($l/aW&j) is the Hessian matrix, and the superscript T denotes the transposition of a column vector. By taking the expectation with respect to the new input-output pair (y,x), we have
because of the identity
-W"I(y
I x,wo)l = E[(Vl)(VOT1
Taking the expectation with respect to Wt, we have
E[Wt - WO] = O(l/t) 1 E[(Wt - WO)(W: - W O ) ~ ]= -G-' -I-O(l/$) t and hence
m E[(Wt - W O ) ~ G ( W - WO)] ~ = - + O(l/P) t
Statistical Theory of Learning Curves
151
and substituting this in (A9),and then summing over i, we have t
C l(yi I X, i=l
fit)
=
C l(yi I xi, wo)
because the maximum likelihood estimator wtsatisfies t
Since the xis are independently generated, by the law of large numbers, we have N
-H0
N
E[VVl(y I X, WO)]= -G
I
- C VVl(yi 1 Xi, wo) t i=l
Since (wt- WO)/& is normally distributed with mean 0 and covariance matrix G-I, (Wf - WO)~G(W' - WO)
can be expressed as a s u m of squares of m independent normal random variables with mean 0 and variance 1, implying that it is subject to the X2-distribution of degree m. Therefore, we have
where xi is a random variable subject to the Xz-distribution of degree m. Since its expectation is m,
This proves Theorem 1. In order to prove Theorem 2, we use the following lemma. Lemma. The maximum likelihood estimator wtunder an unfaithful model is asymptotically normally distributed with mean w; and covariance matrix t-'K*-'GK'-',
Shun-ichi Amari and Noboru Murata
152
We do not give the proof of the lemma, because it is too technical. Refer to Murata et al. (1991).The proof of the theorem is almost parallel to the faithful case, if we replace woby wC; and taking account that K' # G*. The Bayesian case can be proved by using the relations p(w I 6 )
t
N
q(w)tm/21G1'/2exp{--(w 2 - WJTG(w - W t ) }
However, the proof is much more complicated and we omit it. One can complete it by using the asymptotic statistical techniques. Acknowledgments The authors would like to thank Dr. K. Judd for comments on the manuscript. The present research is supported by the Japanese Ministry of Education, Science and Culture under Grant-in-Aid on Special Priority Area of Higher Order Brain Functioning. References Akaike, H. 1974. A new look at the statistical model identification. IEEE Trans. AC-19, 716-723.
Amari, S. 1967. Theory of adaptive pattern classifiers. IEEE Vans. EC-16(3), 299-307.
Amari, S. 1985. Differential-Geometrical Methods in Statistics. Springer Lecture Notes in Statistics, 28, Springer, New York. Amari, S. 1991. Dualistic geometry of the manifold of higher-order neurons. Neural Networks 4, 443445. Amari, S. 1992. Universal property of learning curves. METR92-03, Univ. of Tokyo. Amari, S., Fujita, N., and Shinomoto, S. 1992. Four types of learning curves. Neural Comp. 4(4), 605-618. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Comp. 1,151-160. Gyorgyi, G., and Tishby, N. 1990. Statistical theory of learning a rule. In Neural Networks and Spin Glasses, K. Thuemann and R. Koeberle, eds., pp. 3-36. World Scientific, Singapore. Haussler, D., Kearns, M., and Shapire, R. 1991. Bounds on the sample complexity and the VC dimension. Proc. 4th Ann. Workshopon Computational Learning Theory, pp. 61-73. Morgan Kaufmann, San Mateo, CA. Haussler, D., Littlestone, N., and Warmuth, K. 1988. Predicting (0,l) functions on randomly drawn points. Proc. COLT'BB, pp. 280-295. Morgan Kaufmann, San Mateo, CA.
StatisticalTheory of Learning Curves
153
Hansel, D., and Sompolinsky,H. 1990. Learning from examples in a single-layer neural network. Europhys. Lett. 11, 687-692. Heskes, T. M., and Kappen, B. 1991. Learning processes in neural networks. Phys. Rev.A 440,2718-2726. Levin, E., Tishby, N., and Solla, S. A. 1990. A statistical approach to learning and generalization in layered neural networks. Proc. IEEE 78(10), 1568-1574. Moody, J. E. 1992. The effective number of parameters: An analysis of generalization and regularization in nonlinear systems. In Advances in Neural Information Processing Systems, J . E. Moody, S. J. Hanson, and R. P. Lippmann, eds. Morgan Kaufmann, San Mateo, CA. Murata, N., Yoshizawa, S., and Amari, S. 1991. A criterion for determining the number of parameters in an artificial neural networks model. In Artificial Neural Networks, T. Kohonen, K. Makisara, 0. Simula, and J. Kangas, eds. Elsevier Science Publishers 8. V., North-Holland. Opper, M., and Haussler, D. 1991. Calculation of the learning curve of Bayes optimal classfication algorithm for learning a perceptron with noise. Proc. 4th Ann. Workshopon Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Rissanen, J. 1986. Stochastic complexity and modeling. Ann. Statist. 14, 10801100.
Rosenblatt, F. 1961. Principles of Neurodynamics. Spartan, New York. Rumelhart, D., Hinton, G. E., and Williams, R. J. 1986. Learning internal representationsby error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. Foundations. MIT Press, Cambridge, MA. Seung, S., Sompolinsky, H., and Tishby, N. 1991. Learning from examples in large neural networks. To be published. Valiant, L. G. 1984. A theory of the learnable. Comm. ACM. 27(11), 11341142. White, H. 1989. Learning in artificial neural networks: A statistical perspective. Neural Comp. 1,42!j-464. Widrow, B. 1966. A Statistical Theory of Adaptation. Pergamon Press, Oxford. Yamanishi, K. 1990. A learning criterion for stochastic rules. Proc. 3rd Ann. Workshop on Computational Learning Theory, pp. 67-81. Morgan-Kaufmann, San Mateo, CA. Yamanishi, K. 1991. A loss bound model for on-line stochastic prediction strategies. Proc. 4th Ann. Workshop on Computational Learning Theory. MorganKaufmann, San Mateo, CA. Received 14 November 1991; accepted 18 August 1992.
This article has been cited by: 2. Yu Nishiyama, Sumio Watanabe. 2007. Stochastic complexity of complete bipartite graph-type Boltzmann machines in mean field approximation. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 90:9, 1-9. [CrossRef] 3. Shun-ichi Amari , Hyeyoung Park , Tomoko Ozeki . 2006. Singularities Affect Dynamics of Learning in NeuromanifoldsSingularities Affect Dynamics of Learning in Neuromanifolds. Neural Computation 18:5, 1007-1065. [Abstract] [PDF] [PDF Plus] 4. Koichiro Nishiue, Sumio Watanabe. 2005. Effects of priors in model selection problem of learning machines with singularities. Electronics and Communications in Japan (Part II: Electronics) 88:2, 47-58. [CrossRef] 5. Kazushi Ikeda. 2004. An Asymptotic Statistical Theory of Polynomial Kernel MethodsAn Asymptotic Statistical Theory of Polynomial Kernel Methods. Neural Computation 16:8, 1705-1719. [Abstract] [PDF] [PDF Plus] 6. Koji Tsuda, Shotaro Akaho, Motoaki Kawanabe, Klaus-Robert Müller. 2004. Asymptotic Properties of the Fisher KernelAsymptotic Properties of the Fisher Kernel. Neural Computation 16:1, 115-137. [Abstract] [PDF] [PDF Plus] 7. Toshiaki Aida. 2001. Reparametrization-covariant theory for on-line learning of probability distributions. Physical Review E 64:5. . [CrossRef] 8. Sumio Watanabe . 2001. Algebraic Analysis for Nonidentifiable Learning MachinesAlgebraic Analysis for Nonidentifiable Learning Machines. Neural Computation 13:4, 899-933. [Abstract] [PDF] [PDF Plus] 9. Didier Herschkowitz, Manfred Opper. 2001. Retarded Learning: Rigorous Results from Statistical Mechanics. Physical Review Letters 86:10, 2174-2177. [CrossRef] 10. Wenxin Jiang, M.A. Tanner. 2000. On the asymptotic normality of hierarchical mixtures-of-experts for generalized linear models. IEEE Transactions on Information Theory 46:3, 1005-1013. [CrossRef] 11. Toshiaki Aida. 1999. Field Theoretical Analysis of On-Line Learning of Probability Distributions. Physical Review Letters 83:17, 3554-3557. [CrossRef] 12. Silvia Scarpetta, Magnus Rattray, David Saad. 1999. Journal of Physics A: Mathematical and General 32:22, 4047-4059. [CrossRef] 13. S. Guarnieri, F. Piazza, A. Uncini. 1999. Multilayer feedforward networks with adaptive spline activation function. IEEE Transactions on Neural Networks 10:3, 672-683. [CrossRef] 14. Terrence L. Fine , Sayandev Mukherjee . 1999. Parameter Convergence and Learning Curves for Neural NetworksParameter Convergence and Learning Curves for Neural Networks. Neural Computation 11:3, 747-769. [Abstract] [PDF] [PDF Plus]
15. Magnus Rattray, David Saad. 1999. Analysis of natural gradient descent for multilayer neural networks. Physical Review E 59:4, 4523-4532. [CrossRef] 16. Didier Herschkowitz, Jean-Pierre Nadal. 1999. Unsupervised and supervised learning: Mutual information between parameters and observations. Physical Review E 59:3, 3344-3360. [CrossRef] 17. A. Uncini, L. Vecci, P. Campolucci, F. Piazza. 1999. Complex-valued neural networks with adaptive spline activation function for digital-radio-links nonlinear equalization. IEEE Transactions on Signal Processing 47:2, 505-514. [CrossRef] 18. Magnus Rattray, David Saad, Shun-ichi Amari. 1998. Natural Gradient Descent for On-Line Learning. Physical Review Letters 81:24, 5461-5464. [CrossRef] 19. Jianfeng Feng. 1998. Journal of Physics A: Mathematical and General 31:17, 4037-4048. [CrossRef] 20. A.J. Zeevi, R. Meir, V. Maiorov. 1998. Error bounds for functional approximation and estimation using mixtures of experts. IEEE Transactions on Information Theory 44:3, 1010-1025. [CrossRef] 21. Shun-ichi Amari . 1998. Natural Gradient Works Efficiently in LearningNatural Gradient Works Efficiently in Learning. Neural Computation 10:2, 251-276. [Abstract] [PDF] [PDF Plus] 22. S. Raudys. 1997. On dimensionality, sample size, and classification error of nonparametric linear classification algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence 19:6, 667-671. [CrossRef] 23. A. Atiya, Chuanyi Ji. 1997. How initial conditions affect generalization performance in large networks. IEEE Transactions on Neural Networks 8:2, 448-451. [CrossRef] 24. Sepp Hochreiter, Jürgen Schmidhuber. 1997. Flat MinimaFlat Minima. Neural Computation 9:1, 1-42. [Abstract] [PDF] [PDF Plus] 25. S. Amari, N. Murata, K.-R. Muller, M. Finke, H.H. Yang. 1997. Asymptotic statistical theory of overtraining and cross-validation. IEEE Transactions on Neural Networks 8:5, 985-996. [CrossRef] 26. Manfred Opper. 1996. On-line versus Off-line Learning from Random Examples: General Results. Physical Review Letters 77:22, 4671-4674. [CrossRef] 27. K.-R. Müller, M. Finke, N. Murata, K. Schulten, S. Amari. 1996. A Numerical Study on Learning Curves in Stochastic Multilayer Feedforward NetworksA Numerical Study on Learning Curves in Stochastic Multilayer Feedforward Networks. Neural Computation 8:5, 1085-1106. [Abstract] [PDF] [PDF Plus] 28. Manfred Opper, David Haussler. 1995. Bounds for Predictive Errors in the Statistical Mechanics of Supervised Learning. Physical Review Letters 75:20, 3772-3775. [CrossRef] 29. Florence d'Alché-Buc, Jean-Pierre Nadal. 1995. Asymptotic performances of a constructive algorithm. Neural Processing Letters 2:2, 1-4. [CrossRef]
30. M. B Gordon, D. R Grempel. 1995. Learning with a Temperature-Dependent Algorithm. Europhysics Letters (EPL) 29:3, 257-262. [CrossRef] 31. Peter Sollich. 1994. Query construction, entropy, and generalization in neural-network models. Physical Review E 49:5, 4637-4651. [CrossRef]
Communicated by Haim Sompolinsky
Learning in the Recurrent Random Neural Network Erol Gelenbe Ecole des Hautes Etudes en Informatique, Universitt Rent Descartes (Paris V),45 rue des Saints-P$res, 75006 Paris, France The capacity to learn from examples is one of the most desirable features of neural network models. We present a learning algorithm for the recurrent random network model (Gelenbe 1989, 1990) using gradient descent of a quadratic error function. The analytical properties of the model lead to a "backpropagation" type algorithm that requires the solution of a system of n linear and n nonlinear equations each time the n-neuron network "learns" a new input-output pair. 1 Introduction
The capability to learn from examples is one of the most desirable features of neural network models. Therefore this issue has been at the center of much research in neural network theory and applications (Ackley et al. 1985; Le Cun 1985; Rumelhart et al. 1986). Learning theory in general is of major interest because of its numerous implications in machine intelligence, as well as its ability to provide a better understanding of the relationship between natural and artificial intelligence. In the area of artificial neural networks, learning has been extensively studied in the context of feedfonvard networks, primarily on the basis of the backpropagation algorithm (Rumelhart et al. 1986). Designing effective learning algorithms for general (i.e., recurrent) networks is a current and legitimate scientific concern in neural network theory. There are numerous examples where recurrent networks constitute a natural approach to problems. Such examples include, in particular, image processing and pattern analysis and recognition (see, for instance, Atalay et al. 19911, where local interactions between picture elements lead to mutual interactions between neighboring neurons, which are naturally represented by recurrent networks. In such cases, it is clear that effective learning algorithms for recurrent networks can enhance the value of neural network methodology. Another area where recurrent networks are indispensable is in combinatorial optimization, and it would be interesting to explore further the relationship between the application Neural Computation 5,154-164 (1993) @ 1993 Massachusetts Institute of Technology
Learning in the Recurrent Random Neural Network
155
of neural networks to control and optimization (Gelenbe and Batty 1992) and network learning. Several authors have considered learning algorithms for recurrent connectionist networks (Almeida 1987; Pineda 1987, 1989; Pearlmutter 1989; Behrens et al. 1991). These are based on neural network dynamics, which exhibit a fixed-point behavior. The work presented in this paper extends this approach to the random network model (Gelenbe 1989, 1990), which has the advantage of possessing well-defined fixed-point equations representing the stationary solution of the stochastic network equations. Applications of the random network model to image texture generation, associative memory, pattern recognition, and combinatorial optimization have been described elsewhere (Atalay et al. 1991; Gelenbe et al. 1991; Mokhtari 1991; Gelenbe and Batty 1992). In this paper we present a “backpropagation”type learning algorithm for the recurrent random network model (Gelenbe 1989, 1990), using gradient descent of a quadratic error function when a set of input-output pairs is presented to the network. Both the excitation and inhibition weights of the random network model must be learned by the algorithm. Thus, it requires the solution of a system of 2n linear and n nonlinear equations each time the n-neuron network ”learns” a new input-output pair. The system of nonlinear equations describes the networks fixedpoint, while the linear equations are obtained from the partial derivatives of these equations with respect to the network weights. To justify the use of the algorithm, we prove (in the Appendix) a general theorem concerning necessary and sufficient conditions for the existence of the stationary or fixed-point solution to the network. This general result completes the work presented in Gelenbe (1990) where only more restrictive sufficient conditions were given. Note that for our network existence implies uniqueness of the solution, due to the fact that the random network model is characterized by Chapman-Kolmogorov equations. Furthermore existence implies stability, since all moments of the state distribution can be explicitly computed from the model’s product-form property. 2
The Random Network Model
In the random network model (RN), n neurons exchange positive and negative impulse signals. Each neuron accumulates signals as they arrive, and fires if its total signal count at a given instant of time is positive. Firing then occurs at random according to an exponential distribution of constant rate, and signals are sent out to other neurons or to the outside of the network. Each neuron i of the network is represented at time t by its input signal potential ki(t), constituted only by positive signals that have accumulated, which have not yet been cancelled by negative signals,
Erol Gelenbe
156
and which have not yet been sent out by the neuron as it fires. Positive signals represent excitation, while negative signals represent inhibition. A negative signal reduces by 2 the potential of the neuron to which it arrives (i.e., it “cancels” an existing signal) or has no effect on the signal potential if it is already zero, while an arriving positive signal adds 2 to the neuron potential. This is a simplified representation of biophysical neural behavior (Kandel and Schwartz 1985). In the RN, signals arrive at a neuron from the outside of the network (exogenous signals) or from other neurons. Each time a neuron fires, a signal leaves it depleting its total input potential. A signal leaving neuron i heads for neuron j with probability p+(i,j) as a positive (or normal) signal, or as a negative signal with probability p-(i, j) or it departs from the network with probability d(i). p ( i , j ) = p+(i,j ) + p-(i, j) is the transition probability of a Markov chain representing the movement of signals between neurons. We have j) + d(i) = 1 for 1 5 i 5 n. External (or exogenous) inputs to each neuron i of the network are provided by stationary Poisson processes of rate A(i), and X(i). A neuron is capable of firing and emitting signals if its potential is strictly positive, and firing times are modeled by iid exponential neuron firing times with rate r(i), at neuron i. In Gelenbe (1989) it was shown that this network has a product form solution. That is, the network‘s stationary probability distribution can be written as the product of the marginal probabilities of the state of each neuron. This does not imply that the neurons have a behavior that is independent of each other. Indeed the probabilities that each neuron is excited are obtained from the coupled nonlinear signal flow equations (2) below, which yield the rate of signal arrival and hence the rate of firing of each neuron in steady state. The RN has a number of interesting features:
cip(i,
1. It represents more closely the manner in which signals are trans-
mitted in a biophysical neural network where they travel as spikes rather than as fixed analog signals. 2. It is computationally efficient.
3. It is easy to simulate, since each neuron is simply represented by a
counter; this may lead to a simple hardware implementation. 4. It represents neuron potential and therefore the level of excitation
as an integer, rather than as a binary variable, which leads to more detailed information on system state; a neuron is interpreted as being in the “firing state” if its potential is positive. Let k(t) = [k,(t), . . . ,kn(t)] be the vector of signal potentials at time t, and k = (kl,.. . ,kn) be a particular value of the vector. p(k) denotes the stationary probability distribution p(k) = limt,, Prob[k(t)= k] if it exists. Since { k ( t ) : t 2 0) is a continuous time Markov chain it satisfies the
Learning in the Recurrent Random Neural Network
157
usual Chapman-Kolmogorov equations; thus in steady state p(k) must satisfy the global balance equations:
+ ~ ( i ) ] l [ >k i 0 ] ]= C [ p ( k ; ) r ( i ) d ( i )+ p(kr)A(i)l[ki> 01
p(k) x [ A ( i ) + [A(i) i
i
+ p(kr)X(i)+ E { p ( k ; - ) r ( i ) p + ( i , j ) l [ k j > 01 i
+ ~ ( k $ + ) r ( i ) ~ - ( i ,+j )p(ki+)r(i)p-(i,j)l[kj ) = O]}] where the vectors used are
k; k17 klk$+
( k i , . . . , k i + l ,...,kfl) ( k l , .. . ,ki - 1,. . . , k f l ) = (kl, . . . ,ki + 1,. . . ,kj - 1,.. . ,k f l ) = (k1,. . . ,ki + 1,. . . ,kj 1,.. . ,kn) = =
+
and 1[X]is the usual characteristic function which takes the value 1 if X is true and 0 otherwise.
Theorem (Gelenbe 1989). Let
qi = X+(i)/[r(i)
+ X-(i)]
(2.1)
where the X+(i), A-(i) for i = 1,.. . ,n satisfy the following system of nonlinear simulataneous equations: X+(i) = x q j r ( j ) p + ( j ,i) i
+ A(i),
X-(i)
= x q j r ( j ) p - ( j , i)
i
+ X(i)
(2.2)
I f a unique nonnegative solution {A+(i), X-(i)} exists to equations 2.1 and 2.2 such that each qi < 1, then n
~ ( k=)
n[l-qiI&
i=l
As a consequence of this result, whenever the qi < 1 can be found, the network is stable in the sense that all moments (marginal or joint) of the neural network state can be found from the above formula, and all moments are finite. For instance, the average potential at a neuron i is simply qi/[l- q i ] . The rate (frequency) of the emission of spikes from neuron i in steady state is then 9ir(i). Furthermore because the underlying model is described by Chapman-Kolmogorov equations, whenever there is a solution, it is necessarily unique and given by the above product form formula. If for some neuron, we have X+(i) > [r(i) X-(i)], we say that the neuron is unstable or saturated. This implies that it is constantly excited in steady state: lim,,,Prob[ki(t) > 01 = 1. Its rate of spike emission
+
Em1 Gelenbe
158
is then r ( i ) : to another neuron j of the network its output appears as a constant source of positive or negative signals of rates r ( i ) p + ( i , j ) and Y(Op-(i,j). For notational convenience let us write w+(j, i ) = r ( i ) p + ( i , j ) 2 0, w-v, i) = r ( i ) p - ( i , j ) 2 0 N ( i ) = Cqjw+o,i)+ A(i), and i
D(i)
r(i)
=
+ C qjw-u, i) + x ( i ) i
Then 1 becomes qi
= N(i)/D(i)
(2.3)
and r ( i ) = Ci(w+(j,i) + w-v, i)]. 2.1 The Role of the Parameters w+U, i ) and w-v, i). The weight parameters w+(j,i) and w-u, i) have a somewhat different effect in the RN model than the weights w(/,i)in the connectionist model. In the RN model, all the w+(j,i) and w-(j,i) are nonnegative since they represent rates at which positive and negative signals are sent out from any neuron i to neuron j . Furthermore, in the RN model, for a given pair ( i , j ) it is possible that both w+(i,j) > 0 and w-(i,j) > 0; in general, it is not possible to transform an RN into an equivalent network in which certain connections are only excitatory, while others are only inhibitory, as would be the case in the usual connectionist model. Therefore, in the RN, for each pair 0, i) it will be necessary to learn both w+(i,j ) and w-(i, j ) . 3 Learning with the Recurrent Random Network Model
We now present an algorithm for choosing the set of network parameters W in order to learn a given set of K input-output pairs ( L , Y) where the set of successive inputs is denoted L = ( ~ 1 ,... ,L K } / and Lk = (Ak, A,) are pairs of positive and negative signal flow rates entering each neuron: Ak = [Ak(l), . . . ,Adn)],
xk
= [Xk(l), - .. ,Ak(n)1
The successive desired outputs are the vectors Y = {yl, . . . ,y ~ } ,where each vector yk = (ylk, . . . ,yn& whose elements yik E [O,I] correspond to the desired values of each neuron. The network approximates the set of desired output vectors in a manner that minimizes a cost function E k : n
Ek = (1/2)Cai(qi - yik)’, i=l
Ui
20
Learning in the Recurrent Random Neural Network
159
Without loss of generality, we treat each of the n neurons of the network as an output neuron; if we wish to remove some neuron j from network output it suffices to set Uj = 0 in the cost function, and to disconsider qj when constructing the output of the network. O u r algorithm lets the network learn both n by n weight matrices Wz = {wk+(i,j)}and W; = {w;(i, j ) } by computing for each input Lk = (Ak, Xk), a new value Wt and W; of the weight matrices, using gradient descent. Clearly, we seek only solutions for which all these weights are positive. Let us denote by the generic term w(u,v) either w(u,v) = w-(u,v), or w(u, v) = w+(u,v). The rule for weight update may be written as n
~ ( u , v=)w - I ( u , ~ ) - vCai(qik - ~ i k ) [ a q r / W ~ , v ) ] k
(3.1)
i=l
where 77 > 0 is some constant, and 1. q& is calculated using the input Lk and w(u,v) = ~ k - ~ ( u , vin) ,equation 3. 2. [aqi/bw(u,v)]kis evaluated at the values 9i wk-1
=
qik and w(u,v) =
(u,v).
To compute [aqi/aW(u,u)]kwe turn to the expression 3, from which we derive the following equation:
a q i / h ( u ,v) =
C a q j / h ( u ,v)[w+(j,i ) - w-(i, i)qil/D(i) i
l[u = i]qi/D(i) l[w(u,v) = w+(u,i)]qu/D(i)- l[w(u,u) = w-(u,i)lquqi/D(i) Let q = (ql,.. . ,qn),and define the n x n matrix .. W={[w+(i,j)- w - ( i , j ) q j ] / D ( i ) } i , 1 = 1 , ...,n We can now write the vector equations: -
+
= aq/&+(u,vw+ 7+(u,v)qu 8q/h-(u, = aq/aw- (u,v) W 7 - (u,v)qu where the elements of the n-vectors -y+(u, v ) = [-$(u, v), . . . ,$ ( u , v)], Y ( u , ~= ) [ n ( ~ , v ). ., ., - ~ ( ~are ,41 ^li+(u,v)= -l/D(i) if u = i,v # i,
aq/&+(u,v)
= =
~,:(u,v) = = =
=
+
+l/D(i) ifu#i,v=i, 0 for all other values of (u,v), -(1 + q i ) / D ( i ) if u = i,v = i, -l/D(i) i f u = i , v # i , -9i/D(i) if u # i,v = i, 0 for all other values of (u,v)
Em1 Gelenbe
160
Notice that
(3.2) where I denotes the n by n identity matrix. Hence the main computational effort in solving 3.2 is simply to obtain [I which can be done in time complexity O(n3), or O(mnz)if an m-step relaxation method is used. Since the solution of 3 is necessary for the learning algorithm, in the Appendix we derive necessary and sufficient conditions for the existence of the q j . We now have the information to specify the complete learning algorithm for the network Initiate the matrices W$ and W; in some appropriate manner. This initiation will be made at random (among nonnegative matrices) if no better information is available; in some cases it may be possible to choose these initial values by using a Hebbian learning rule. Choose a value of in 4.
v-’,
1. For each successive value of k, starting with k = 1 proceed as follows. Set the input values to Lk = (Ak, Xk).
2. Solve the system of nonlinear equations 3 with these values. 3. Solve the system of linear equations 3.2 with the results of (2). 4. Using equation 3.1 and the results of (2) and (31, update the matrices Wt and Wi.Since we seek the “best” matrices (in terms of gradient descent of the quadratic cost function) that satisfy the nonnegutivity constraint, in any step k of the algorithm, if the iteration yields a negative value of a term, we have two alternatives:
a. set the term to zero, and stop the iteration for this term in this step k; in the next step k + 1 we will iterate on this term with the same rule starting from its current null value; b. go back to the previous value of the term and iterate with a smaller value of 77. In our implementation we have used (a). Note that we may either proceed with a complete gradient descent [iterating on Steps (21, (3), and (4) until the change in the cost function or in the new values of the weights is smaller than some predetermined value], or only one iteration can be carried out for all the weights for each successive value of k (new input).
Learning in the Recurrent Random Neural Network
161
Clearly, one may either update the weight matrices separately for each successive value of k (i.e., successive input) as suggested, or sum the updates for all inputs at each iteration of the algorithm. 3.1 Complexity. Several authors have examined the complexity of neural network learning algorithms (Pineda 1989; Baum 1991). One viewpoint (Pineda 1989) is to consider the complexity of each network weight update, while another is to consider the complexity of learning a given family of input-output functions (Baum 1991). In the latter approach, it is known that learning even elementary boolean functions using the backpropagation algorithm is NP complete. In fact, the complexity of our learning algorithm is of the same order as that of the algorithms described in Pineda (1989). Here we merely discuss the complexity of weight update for the algorithm we have presented. Notice that the algorithm requires that for each (u,v) and for each input (successive k) we solve the nonlinear system of equations 3, and the linear system 3.2. Equations 3.2 have to be solved for each (u,v ) . [I, is obtained in time complexity O(n3),or in time complexity O(rnn2)if a relaxation method with rn iterations is adopted as suggested, for instance in Pineda (1989). The remaining computations for 3.2 are trivial. Similarly for 3, which is a nonlinear system of e uations (to be solved once for each step), the complexity will be O(mn ).
w-'
9
Appendix: Existence and Uniqueness of Network Solutions As with most neural network models, the signal flow equations 1 and 2,
which describe the manner in which each neuron receives inhibitory or excitatory signals from other neurons or from the outside world, are nonlinear. These equations are essential to the construction of the learning algorithm described above. Yet only sufficient conditions for the existence (and uniqueness) of their solution had previously been established for feedforward networks, or for so-called hyperstable networks (Gelenbe 1989, 1990). Thus in order to implement the learning algorithm it is useful to have necessary and sufficient conditions for their existence. This is precisely what we do in this appendix. Rewrite 1 and 2 as follows:
+
+ h(i), = C X+(j)p-(j, i)r(i)/[r(i)+ ~ ( i )+]X(i)
X+(i) = cX+(j)p+(j, i)r(i)/[r(i) X-(i)] i
(A.1) i where the qi have disappeared from the equations. The X+(i) and X-(i) represent the total arrival rates of positive and negative signals to neuron i.
X-(i)
Em1 Gelenbe
162
Define the following vectors: A' with elements X+(i) X- with elements X-(i)
A with elements A(i) X with elements X(i)
Let F be the diagonal matrix with elements fi = r(i)/[r(i) + X-(i)] I 1. Then A.l may be written as
+ A,
A' = X'FP'
X'(1-
X- = X'FP-
+X
FP') = A
(A.2)
+X
(A.3)
A- = X'FP-
Proposition 1. Equations A.2 and A.3 have a solution (A', A-). Proof. Since the series CEo(FP+)n is geometrically convergent (Kemeny and Snell 1960, p. 43 ff), we can write A.2 as
n=O
so that A.3 becomes (A.4)
Now define y = X- - A, and call the vector function m
where the dependence of G on y comes from F, which depends on A-. Notice that G is continuous. Therefore by Brouwer's fixed-point theorem,
Learning in the Recurrent Random Neural Network
163
has a fixed-point y*. This fixed point will in turn yield the solution of A.2 and A.3: m
n=O
completing the proof. 0 If this computation yields a fixed-point y* such that for any neuron i such that X+(i) 2 [r(i) X-(i)], the stationary solution for neuron i does not exist; this simply means that in steady state neuron i is constantly excited, and we set qi(y*)= 1. If on the other hand we obtain X+(i) < [r(i) X-(i)], then we set qi(y*) = A+(i)/[r(i)+ X-(i)]. Since p ( k ) is a probability distribution it must s u m to 1, which is the case if qi(y*)< 1, for all i and hence p ( k ) exists. Let us insist on the fact that p ( k ) is indeed unique, and that 9 4 ~ ’ ) < 1 for all i implies stability (in the sense of finiteness of all moments of the state).
+
+
Remark. This reduces the problem of determining existence and uniqueness of the steady-state distributions of a random network to that of computing y’ from A.5, which always exists by Proposition 1, and then of verifying the condition qi(y*)< 1, for each i = 1,.. . ,YZ. Acknowledgments The author acknowledges the support of Pale Algorithmique Rbpartie, C3 CNRS, the French National Program in Distributed Computing, and of a Grant from the Minist6re de la Recherche et de la Technologie(Paris, France).
References Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. 1985. A learning algorithm for Boltzmann machines. Cog. Sci. 9, 147-169. Almeida, L. B. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. Proc. IEEE First International Conf. Neural Networks, San Diego, CA, Vol. 11, pp. 609-618. Atalay, V., Gelenbe, E., and Yalabik, N. 1991. Texture generation with the random neural network model. In Artificial Neural Networks, Vol. I, T. Kohonen, ed., pp. 111-117. North-Holland, Amsterdam. Baum, E. B. 1991. Neural net algorithms that learn in polynomial time from examples and queries. Draft Paper May 11 (private communication). Behrens, H., Gawronska, D., Hollatz, J., and Schurmann, B. 1991. Recurrent and feedforward backpropagation: Performance studies. In Artqcial Neural Networks, Vol. 11, T. Kohonen et al., eds., pp. 1511-1514. North-Holland, Amsterdam. Gelenbe, E. 1990. Stability of the random neural network model. Neural Cornp. 2(2), 239-247.
164
Erol Gelenbe
Gelenbe, E. 1989. Random neural networks with negative and positive signals and product form solution. Neural Comp. 1(4),502-510. Gelenbe, E.,and Batty, F. 1992. Minimum cost graph covering with the random network model. ORSA TC on Computer Science Conference, Williamsburg, VA, January. Pergamon Press, Oxford. Gelenbe, E., Stafilopatis, A., and Likas, A. 1991. In Artificial Neural Networks, Vol. I, T. Kohonen, ed., pp. 307-315. North-Holland, Amsterdam. Kandel, E. C., and Schwartz, J. H. 1985. Principles of Neural Science, Elsevier, Amsterdam. Kemeny, J. G., and Snell, J. L. 1965.FiniteMarkov Chins. Van Nostrand, Princeton, NJ. Le Cun, Y. 1985.A learning procedure for asymmetric threshold networks. Proc. Cognitiva 85, 599-604. Mokhtari, M. 1992. Recognition of typed images with the random network model. lnt. 1.Artificial Intelligence Pattern Recognition, in press. Pearlmutter, 8. A. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1(2), 263-269. Pineda, F. J. 1987. Generalization of backpropagation to recurrent and higher order neural networks. In Neural Information Processing Systems, D. Z. Anderson, ed., p. 602. American Institute of Physics. Pineda, F. J. 1989. Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Comp. 1(2), 161-172. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., Vols. I and II, Bradford Books and MIT Press, Cambridge, MA. Received 3 September 1991;accepted 27 May 1992.
This article has been cited by: 2. Zhihao Guo, Shaya Sheikh, Camelia Al-Najjar, Hyun Kim, Behnam Malakooti. 2010. Mobile ad hoc network proactive routing with delay prediction using neural network. Wireless Networks 16:6, 1601-1620. [CrossRef] 3. Samar Samir Mohamed, J. M. Li, M. M. A. Salama, G. H. Freeman, H. R. Tizhoosh, A. Fenster, K. Rizkalla. 2010. An Automated Neural-Fuzzy Approach to Malignant Tumor Localization in 2D Ultrasonic Images of the Prostate. Journal of Digital Imaging . [CrossRef] 4. Jin-ting Wang, Peng Zhang. 2009. A single-server discrete-time retrial G-queue with server breakdowns and repairs. Acta Mathematicae Applicatae Sinica, English Series 25:4, 675-684. [CrossRef] 5. Erol Gelenbe, Peixiang Liu, Boleslaw K. Szymanski, Christopher Morrell. 2009. Cognitive and self-selective routing for sensor networks. Computational Management Science . [CrossRef] 6. Erol Gelenbe, Stelios Timotheou. 2008. Random Neural Networks with Synchronized InteractionsRandom Neural Networks with Synchronized Interactions. Neural Computation 20:9, 2308-2324. [Abstract] [PDF] [PDF Plus] 7. P. G. Harrison, T. Kocak, E. Gelenbe. 2008. Discussant Contributions for the Computer Journal Lecture by Erol Gelenbe. The Computer Journal 51:6, 731-734. [CrossRef] 8. E. Gelenbe, S. Timotheou. 2008. Synchronized Interactions in Spiked Neuronal Networks. The Computer Journal 51:6, 723-730. [CrossRef] 9. Erol Gelenbe. 2007. Steady-state solution of probabilistic gene regulatory networks. Physical Review E 76:3. . [CrossRef] 10. Erol Gelenbe. 2006. Introduction. Computational Management Science 3:3, 175-176. [CrossRef] 11. Pavel Bocharov, Ciro D’Apice, Alexandre Pechinkin. 2006. Product form solution for exponential G-networks with dependent service and completion of service of killed customers. Computational Management Science 3:3, 177-192. [CrossRef] 12. Alper Teke, Volkan Atalay. 2006. Texture Classification and Retrieval Using the Random Neural Network Model. Computational Management Science 3:3, 193-205. [CrossRef] 13. M. A. Karkoub. 2006. Prediction of Hydroforming Characteristics using Random Neural Networks. Journal of Intelligent Manufacturing 17:3, 321-330. [CrossRef] 14. Erol Gelenbe, Peixiang Liu, Jeremy LainLaine. 2006. Genetic Algorithms for Route Discovery. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 36:6, 1247-1254. [CrossRef] 15. Ricardo Lent. 2006. Linear QoS Goals of Additive and Concave Metrics in <emphasis emphasistype="italic">Ad Hoc Cognitive Packet Routing.
IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 36:6, 1255-1260. [CrossRef] 16. T. Kocak, J. Seeber, H. Terzioglu. 2003. Design and implementation of a random neural network routing engine. IEEE Transactions on Neural Networks 14:5, 1128-1143. [CrossRef] 17. S. Mohamed, G. Rubino. 2002. A study of real-time packet video quality using random neural networks. IEEE Transactions on Circuits and Systems for Video Technology 12:12, 1071-1083. [CrossRef] 18. E. Gelenbe, K.F. Hussain. 2002. Learning in the multiple class random neural network. IEEE Transactions on Neural Networks 13:6, 1257-1267. [CrossRef] 19. E. Gelenbe, E. Seref, Z. Xu. 2001. Simulation with learning agents. Proceedings of the IEEE 89:2, 148-157. [CrossRef] 20. C.E. Cramer, E. Gelenbe. 2000. Video quality and traffic QoS in learning-based subsampled and receiver-interpolated video sequences. IEEE Journal on Selected Areas in Communications 18:2, 150-167. [CrossRef] 21. E. Gelenbe, Zhi-Hong Mao, Yan-Da Li. 1999. Function approximation with spiked random networks. IEEE Transactions on Neural Networks 10:1, 3-9. [CrossRef] 22. E. Gelenbe, A. Ghanwani, V. Srinivasan. 1997. Improved neural heuristics for multicast routing. IEEE Journal on Selected Areas in Communications 15:2, 147-155. [CrossRef] 23. E. Gelenbe, Yutao Feng, K.R.R. Krishnan. 1996. Neural network methods for volumetric magnetic resonance imaging of the human brain. Proceedings of the IEEE 84:10, 1488-1496. [CrossRef] 24. C. Cramer, E. Gelenbe, H. Bakircloglu. 1996. Low bit-rate video compression with neural networks and temporal subsampling. Proceedings of the IEEE 84:10, 1529-1543. [CrossRef]
REVIEW
Communicated by Steven Nowlan
Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms 0. Nerrand
P. Roussel-Ragot L. Personnaz G. Dreyfus Ecole Supirieure de Physique et de Chimie Industrielles de la Ville de Paris, 10, rue Vauquelin, 75005 Paris, France
S . Marcos Laboratoire des Signaux et S y s t h e s , Ecole Supirieure d’Electricit6, Plateau de Moulon, 91192 Gif sur Yvette, France
The paper proposes a general framework that encompasses the training of neural networks and the adaptation of filters. We show that neural networks can be considered as general nonlinear filters that can be trained adaptively, that is, that can undergo continual training with a possibly infinite number of time-ordered examples. We introduce the canonical form of a neural network. This canonical form permits a unified presentation of network architectures and of gradient-based training algorithms for both feedforward networks (transversal filters) and feedback networks (recursive filters). We show that several algorithms used classically in linear adaptive filtering, and some algorithms suggested by other authors for training neural networks, are special cases in a general classification of training algorithms for feedback networks. 1 Introduction
The recent development of neural networks has made comparisons between “neural” approaches and classical ones an absolute necessity, to assess unambiguously the potential benefits of using neural nets to perform specific tasks. These comparisons can be performed either on the basis of simulations-which are necessarily limited in scope to the systems that are simulated--or on a conceptual basis-ndeavoring to put into perspective the methods and algorithms related to various approaches. The present paper belongs to the second category. It proposes a general framework that encompasses algorithms used for the training of neural networks and algorithms used for the estimation of the parameters of filters. Specifically, we show that neural networks can be used Neural Computation 5,165-199 (1993) @ 1993 Massachusetts Institute of Technology
166
0.Nerrand et al.
adaptively, that is, can undergo continual training with a possibly infinite number of time-ordered examples-in contradistinction to the traditional training of neural networks with a finite number of examples presented in an arbitrary order; therefore, neural networks can be regarded as a class of nonlinear adaptive filters, either transversal or recursive, which are quite general because of the ability of feedforward nets to approximate nonlinear functions. We further show that algorithms that can be used for the adaptive training of feedback neural networks fall into four broad classes; these classes include, as special instances, the methods that have been proposed in the recent past for training neural networks adaptively, as well as algorithms that have been in current use in linear adaptive filtering. Furthermore, this framework allows us to propose a number of new algorithms that may be used for nonlinear adaptive filtering and for nonlinear adaptive control. The first part of the paper is a short presentation of adaptive,filters and neural networks. In the second part, we define the architectures of neural networks for nonlinear filtering, either transversal or recursive; we introduce the concept of canonicalform of a network. The third part is devoted to the adaptive training of neural networks; we first consider transversal filters, whose training is relatively straightforward; we subsequently consider the training of feedback networks for nonlinear recursive adaptive filtering, which is a much richer problem; we introduce undirected, semidirected, and directed algorithms, and put them into the perspective of standard approaches in adaptive filtering (output error and equation error approaches) and adaptive control (parallel and series-parallel approaches), as well as of algorithms suggested earlier for the training of neural networks.
2 Scopes of Adaptive Filters and of Neural Networks 2.1 Adaptive Filters. Adaptive filtering is of central importance in many applications of signal processing, such as the modeling, estimation, and detection of signals. Adaptive filters also play a crucial role in system modeling and control. These applications are related to communications, radar, sonar, biomedical electronics, geophysics, etc. A general discretetime filter defines a relationship between an input time sequence {u(n),u(n - l),. . .} and an output time sequence { y ( n ) , y(n - l),. . .}, u(n) and y ( n ) being either uni- or multidimensional signals. In the following, we consider filters having one input and one output. The generalization to multidimensional signals is straightforward. There are two types of filters: (1) trunsversuljilters (termed finite impulse response or FIR filters in linear filtering) whose outputs are functions of the input signals only; and (2) recursive filters (termed infinite impulse response or IIR filters in linear filtering) whose outputs are func-
Neural Networks and Adaptive Filtering
167
tions both of the input signals and of a delayed version of the output signals. Hence, a transversal filter is defined by y(n) = @[u(n),~ ( -nl ) ,. . . , u(n - M
+ l)]
(1)
where M is the length of the finite memory of the filter, and a recursive filter is defined by y(n) = @ [ u ( n ) , u ( n - l,..., ) u(n-M+l),y(n-l),y(n-2) ,...,y(n-N)](2)
where N is the order of the filter. The ability of a filter to perform the desired task is expressed by a criterion; this criterion may be either quantitative, for example, maximizing the signal-to-noise ratio for spatial filtering (see for instance Applebaum and Chapman 1976), minimizing the bit error rate in data transmission (see for instance Proakis 19831, or qualitative, for example, listening for speech prediction (see for instance Jayant and No11 1984). In practice, the criterion is usually expressed as a weighted sum of squared differences between the output of the filter and the desired output (e.g., LS criterion). An adaptive filter is a system whose parameters are continually updated, without explicit control by the user. The interest in adaptive filters stems from two facts: (1) tailoring a filter of given architecture to perform a specific task requires a priori knowledge of the characteristics of the input signal; since this knowledge may be absent or partial, systems that can learn the characteristics of the signal are desirable; and (2) filtering nonstationary signals necessitates systems that are capable of tracking the variations of the characteristics of the signal. The bulk of adaptive filtering theory is devoted to linear adaptive filters, defined by relations (1)and (21, where @ is a linear function. Linear filters have been extensively studied, and are appropriate for many purposes in signal processing. A family of particularly efficient adaptation algorithms has been specially designed in the case of transversal linear filtering; they are referred to as the recursive least square (RLS) algorithms and their fast (FRLS) versions (Bellanger 1987; Haykin 1991). Linear adaptive filters are widely used for system and signal modeling, due to their simplicity, and due to the fact that in many cases (such as the estimation of gaussian signals) they are optimal. Despite their popularity, they remain inappropriate in many cases, especially for modeling nonlinear systems; investigations along these lines have been performed for adaptive detection (see for instance Picinbono 19881, prediction, and estimation (see for instance McCannon et al. 1982). Unfortunately, when dealing with nonlinear filters, no general adaptation algorithm is available, so that heuristic approaches are used. By contrast, general methods for training neural networks are available; furthermore, neural networks are known to be universal approximants (see for instance Hornik et al. 19891, so that they can be used to approximate any smooth nonlinear function. Since both the adaptation of filters (Haykin 1991; Widrow and Stearns 1985) and the training of
0. Nerrand et al.
168
neural networks involve gradient techniques, we propose to build on this algorithmic similarity a general framework that encompasses neural networks and filters. We do this in such a way as to clanfy how neural networks can be applied to adaptive filtering problems. 2.2 Neural Networks. The reader is assumed to be familiar with the scope and principles of the operation of neural networks; to help clarlfy the relations between neural nets and filters, the present section presents a broad classification of neural network architectures and functions, restricted to networks with supervised training.
2.2.1 Functions of Neural Networks. The functions of neural networks depend on the network architectures and on the nature of the input data: 0
0
Networkarchitectures: neural networks can have either a feedforward structure or a feedback structure; Input data: the succession of input data can be either time-ordered or arbitrarily ordered.
Feedback networks (also termed recurrent networks) have been used as associative memories, which store and retrieve either fixed points or trajectories in state space. The present paper stands in a completely different context we investigate feedback neural networks that are never left to evolve under their own dynamics, but that are continually fed with new input data. In this context, the purpose of using neural networks is not that of storing and retrieving data, but that of capturing the (possibly nonstationary) characteristics of a signal or of a system. Feedforzuard neural networks have been used basically as classifiers for patterns whose sequence of presentation is not significant and carries no information, although the ordering of components within an input vector may be significant. In contrast, the time ordering of the sequence of input data is of fundamental importance for filters: the input vectors can be, for instance, the sequence of values of a sampled signal. At time n, the network is presented with a window of the last M vaIues of the sampled signal {u(n),u(n- l),. . . , u(n - M + l)}, and, at time n + 1, the input is shifted by one time period {u(n + l),u(n),. . . ,u(n - M + 2 ) ) . In this context, feedfonoard networks are used as transversal filters, and feedback networks are used as recursive filters. A very large number of examples of feedforward networks for classification can be found in the literature. Neural network associative memories have also been very widely investigated (Hopfield 1982; Personnaz et al. 1986; Pineda 1987). Feedforward networks have been used for prediction (Lapedes and Farber 1988; Pearlmutter 1989; Weigend et al. 1990). Examples of feedback networks for filtering can be found in Robinson and Fallside (1989), Elman (1990), and Poddar and Unnikrishnan (1991).
Neural Networks and Adaptive Filtering
169
Note that the above classification is not meant to be rigid. For instance, Chen et al. (1990) encode a typical filtering problem (channel equalization)into a classificationproblem. Conversely, Waibel et al. (1989) use a typical transversal filter structure as a classifier.
2.2.2 Nonadaptive and Adaptive Training. At present, in the vast majority of cases, neural networks are not used adaptively: they are first trained with a finite number of training samples, and subsequently used, for example, for classification purposes. Similarly, nonadaptive filters are first trained with a finite number of time-ordered samples, and subsequently used with fixed coefficients. In contrast, adaptive systems are trained continually while being used with an infinite number of samples. The instances of neural networks being trained adaptively are quite few (Williams and Zipser 1989a,b; Williams and Peng 1990; Narendra and Parthasarathy 1990,1991). 3 Structure of Neural Networks for Nonlinear Filtering 3.1 Model of Discrete-TimeNeuron. The behavior of a discrete-time neuron is defined by relation 3:
where
f i is the activation function of neuron i vi is the potential of neuron i zj can be either the output of neuron j or the value of a network
input j
Pi is the set of indices of the afferent neurons and network inputs to neuron i ~ i j is , ~ the weight of the synapse that transfers information from neuron or network input j to neuron i with (discrete) delay T
qij is the maximal delay between neuron j and neuron i. It should be clear that several synapses can transfer information from neuron (or network input) j to neuron if each synapse having its own delay T and its own weight C Y , ~ .Obviously, one must have Cii,o = 0 V i for causality to be preserved. If neuron i is such that i 4 Pi and qi, = 0 V j E Pi, neuron i is said to be static.
0. Nerrand et al.
170
Output at time n
State variables at
k ....I
Feedforward network 1
.....
Unit [,I.*.[iJ I 3 delays
.......
Figure 1: General canonical form of a feedback neural network.
3.2 Structure of Neural Networks for Filtering. The architecture of a network, that is, the topology of the connections and the distribution of delays, may be fully or partially imposed by the problem that must be solved: the problem defines the sequence of input signal values and of desired outputs; in addition, a priori knowledge of the problem may give hints which help designing an efficient architecture [see for instance the design of the feedforward network described in Waibel et al. (1989)l. To clarify the presentation and to make the implementation of the training algorithms easier, the canonical form of the network is especially conve nient. We first introduce the canonical form of feedback networks; the canonical form of feedforward networks will appear as a special case.
3.2.1 The Canonical Form of Feedback Networks. The dynamics of a discretetime feedback network can be described by a finite-difference equation of order N , which can be expressed by a set of N first-order difference equations involving N variables (termed state variables) in addition to the M input variables. Thus,any feedback network can be cast into a canonical
form that consists of a feedforward (static) network whose outputs are the outputs of the neurons that have desired values, and the values of the state variables, whose inputs are the inputs of the network and the values of the state variables, the latter being delayed by one time unit (Fig. 1).
Neural Networks and Adaptive Filtering
171
Note that the choice of the set of state variables is not necessarily unique: therefore, a feedback network may have several canonical forms. The state of the network is the set of values of the state variables. In the following, all vectors will be denoted by uppercase letters. The behavior of a single-input-single-outputnetwork is described by the state equation 4 and output equation 4a:
s ( n + 1) = cp[S(n),W)]
(4)
where U ( n ) is the vector of the M last successive values of the external input u and S ( n ) is the vector of the N state variables (state vector). The output of the network may be a state variable. The transformation of a noncanonical feedback neural network filter to its canonical form requires the determination of M and of N . In the single-input-single-output case, the computation of the maximum number of external inputs E (M I E) is done as follows: construct the network graph whose nodes are the neurons and the input, and whose edges are the connections between neurons, weighted by the values of the delays; find the direct path of maximum weight D from input to output; one has E = D + 1. The determination of the order N of the network from the network graph is less straightforward; it is described in Appendix 1. If the task to be performed does not suggest or impose any structure for the filter, one may use either a multilayer perceptron, or the most general form of feedforward network in the canonical form, that is, a fully connected network; the number of neurons, of state variables and of delayed inputs must be found by trial and error. If we assume that the state variables are delayed values of the output, or if we assume that the state of the system can be reconstructed from values of the input and output, then all state variables have desired values. Such is the case for the NARMAX model (Chen and Billings 1989) and for the systems investigated in Narendra and Parthasarathy (1990).Figure 2 illustrates the most general form of the canonical form of a network having a single output y(n) and N state variables {y(n - l),. . .,y(n -N)}. It features M external inputs, N feedback inputs, and one output; it can implement a fairly large class of functions a; the nonrecursive part of the network (which implements function CP) is a fully connected feedforward net. More specific architectures are described in the literature, implementing various classes of functions cp and Q. Some examples of such architectures are presented in Appendix 2. 3.2.2 Special Case: The Canonical Form of Feedforward Networks. Similarly, any feedforward network with delays, with input signal u, can be cast into the form of a feedforward network of static neurons, whose inputs are the successive values u(n),u(n - l ) ,. . . ,u(n - M + 1);this puts
0. Nerrand et al.
172
. . . .....
......
Figure 2 Canonical form of a network with a fully connected feedforward net, whose state variables are delayed values of the output.
the network under the form of a transversal filter obeying relation 1:
y(n) = @[u(n),u(n - l),. . . ,u(n - M
+ l)]= @[U(n)]
The transformation of a noncanonical feedforward neural network filter to its canonical form requires the determination of the maximum value M, which is done as explained above in the case of feedback networks. An example described in Appendix 1 shows that this transformation may introduce the replication of some weights, known as “shared weights.”
4 Training Neural Networks for Adaptive Filtering 4.1 Criterion. The task to be performed by a neural network used as a filter is defined by a (possibly infinite) sequence of inputs u and of corresponding desired outputs d. At each sampling time n, an error e(n) is defined as the difference between the desired output d(n) and the actual output of the network y(n): e(n) = d(n) - y(n). For instance, in process identification, d(n) is the output of the process to be modeled; in a predictor, d(n) is the input signal at time n + 1. The training algorithms aim at finding the network coefficientsso as to satisfy a given quality criterion. For example, in the case of nonadaptive
Neural Networks and Adaptive Filtering
173
training (as defined in Section 2.2.2), the most popular criterion is the least squares (LS)criterion; the cost function to be minimized is
Thus, the coefficients minimizing J ( C ) are first computed with a finite number K of samples; the network is subsequently used with these fixed coefficients. In the context of adaptive training, taking into account all the errors since the beginning of the optimization does not make sense; thus, one can implement a forgetting mechanism. In the present paper, we use a rectangular "sliding window'' of length N,;hence the following cost function: 1 I(n,C) = 2
" C p=n-Nc+l
e(pI2
The choice of the length N, of the window is task-dependent, and is related to the typical time scale of the nonstationarity of the signal to be processed. In the following, the notation ](n) will be used instead of J ( n ,C). The computation of e(p) will be discussed in Sections 4.3 and 4.4.2. 4.2 Adaptive Training Algorithms. Adaptive algorithms compute, in real time, coefficient modifications based on past information. In the present paper, we consider only gradient-based algorithms, which require the estimation of the gradient of the cost function, V J ( n )and , possibly the estimation of I(n);these computations make use of data available at time n. In the simplest and most popular formulation, a single modification of the vector of coefficients A C ( n ) = C ( n ) - C(n - 1) is computed between time n and time n + 1; such a method, usual in adaptive filtering, is termed a purely recursive algorithm. The modification of the coefficients is often performed by the steepestdescent method, whereby A C ( n ) = - p V J ( n ) . To improve upon the steepest-descent method, quasi-Newton methods can be used (Press et al. 1986), whereby A C ( n ) = + p D ( n ) , where D ( n ) is a vector obtained by a linear transformation of the gradient. Purely recursive algorithms were introduced in order to avoid timeconsuming computations between the reception of two successive samples of the input signal. If the application under investigation does not have stringent time requirements, then other possibilities can be considered. For instance, if it is desired to get closer to the minimum of the cost function, several iterations of the gradient algorithm can be performed between time n and time n + 1. In that case, the coefficient-modification
0. Nerrand et al.
174
vector AC(n) is computed iteratively as AC(n) = CK,(n) - Co(n), where K, is the number of iterations at time n, with
Ck(n) = Ck-l(n) -k pk&-l(n)
(k = 1 to Kn)
where Dk-l(n) is obtained from the coefficients computed at iteration k - 1, and Co(n) = CKnvI(n- 1). If K, > 1, the tracking capabilities of the system in the nonstationary case, or the speed of convergence to a minimum in the stationary case, may be improved with respect to the purely recursive algorithm. The applicability of this method depends specifically on the ratio of the typical time scale of the nonstationarity to the sampling period. As a final variant, it may be possible to update the coefficients with a period T > 1 if the time scale of the nonstationarity is large with respect to the sampling period:
cO(n) = CK,,-,(n - T ) Whichever algorithm is chosen, the central problem is the estimation of the gradient, VJ(n):
At present, two techniques are available for this computation: the forward computation of the gradient and the popular backpropagation of the gradient. 1. The forward computation of the gradient is based on the following
relation:
The partial derivatives of the output at time n with respect to the coefficients appearing on the right-hand side are computed recursively in the forward direction, from the partial derivatives of the inputs to the partial derivatives of the outputs of the network. 2. In contrast, backpropagation uses a chain derivation rule to com-
pute the gradient of J(n). The required partial derivatives of the cost function J(n) with respect to the potentials are computed in the backward direction, from the output to the inputs. The advantages and disadvantages of these two techniques will be discussed in Sections 4.3 and 4.4.2.
Neural Networks and Adaptive Filtering
175
In the following, we show how to compute the coefficient modifications for feedforward and feedback neural networks, and we put into perspective the training algorithms developed recently for neural networks and the algorithms used classically in adaptive filtering. 4.3 Training Feedforward Neural Networks for Nonlinear 'Ransversal Adaptive Filtering. We consider purely recursive algorithms (i.e., T = 1 and K,,= 1). The extension to non-purely recursive algorithms is straightforward. As shown in Section 3.2.2, any discrete-time feedforward neural network can be cast into a canonical form in which all neurons are static. The output of such a network is computed from the M past values of the input, and the output at time n does not depend on the values of the output at previous times. Therefore, the cost function
is a sum of N, independent terms. Its gradient can be computed, from the N, + M + 1 past input data and the N, corresponding desired outputs, as a sum of N, independent terms: therefore, the modification of the coefficients, at time n, is the sum of N, elementary modifications computed from N, independent, identical elementary blocks [each of them with coefficients C ( n - l)],between time n and time n 1. We introduce the following notation, which will be used both for feedforward and for feedback networks: the blocks are numbered by m; all values computed from block m of the training network will be denoted with superscript m. For instance, y"(n) is the output value of the network computed by the mth block at time n: it is the value that the output of the filter would have taken on, at time n - N, m, if the vector of coefficients of the network at that time had been equal to C(n - 1). With this notation, the cost function taken into account for the modification of the coefficients at time n becomes
+
+
+
where em(n)= d(n - N, m ) - y"(n) is the error for block m computed at time n. As mentioned in Section 4.2, two techniques are available for computing the gradient of the cost function: the forward computation technique (used classically in adaptive filtering) and the backpropagation technique (used classically for neural networks) (Rumelhart et al. 1986). Thus, each
0. Nerrand et al.
176
block, from block m = 1 to block rn = N,, computes a partial modification AcF of the coefficients and the total modification, at time n, is NC
Ac,( n ) =
Acy (n) m=l
as illustrated in Figure 3. It was mentioned above that either the forward computation method or the backpropagation method can be used for the estimation of the gradient of the cost function. Both techniques lead to exactly the same numerical results; it has been shown (Pineda 1989) that backpropagation is less computationally expensive than forward computation. Therefore, for the training of feedforward networks operating as nonlinear transversal filters, backpropagation is the preferred technique for gradient estimation. However, as we shall see in the following, this is not always the case for the training of feedback networks. 4.4 Training Feedback Neural Networks for Nonlinear Recursive Adaptive Filtering. This section is devoted to the adaptive training of feedback networks operating as recursive filters. This problem is definitely richer, and more difficult, than the training of feedforward networks for adaptive transversal filtering. We present a wide variety of algorithms, and elucidate their relationships to adaptation algorithms used in linear adaptive filtering and to neural network training algorithms.
4.4.1 General Presentation ofthe Algorithmsfor Training Feedback Networks. Since the state variables and the output of the network at time n depend on the values of the state variables of the network at time n - 1, the computation of the gradient of the cost function requires the computation of partial derivatives from time n = 0 up to the present time n. This is clearly not practical, since (1) the amount of computation would grow without bound, and (2) in the case of nonstationary signals, taking into account the whole past history does not make sense. Therefore, the estimation of the gradient of the cost function is performed by truncating the computations to a fixed number of sampling periods N f into the past. Thus, one has to use N f computational blocks (defined below), numbered from rn = 1 to rn = N f : the outputs y"(n) are computed through Nt identical versions of the feedforward part of the canonical form of the network [each of them with coefficients C(n - 1)l. Clearly, N f must be larger than or equal to N, to compute the N, last errors P(n).Here again, we first consider the case where T = 1 and K, = 1. Figure 4 shows the mth computational block for the forward computation technique: the state input vector is denoted by Sg(n); the state output vector is denoted by Stut(n).The canonical feedforward (FF)net computes the output from the external inputs P ( n )and the state inputs
Figure 3 Computation of two successive coefficient modifications for a nonlinear transversal filter (N,= 3).
At time n
178
0. Nerrand et al.
Figure 4 Training block m at time n with a desired output value: computation of a partial coefficient modification using the forward computation of the gradient for a feedback neural network. If the output of block m has no desired value, it has no "products" part and does not contribute directlv to coefficient modifications: it just transmits the state variables and their derivatives to the next block.
Sg(n). The forward computation (FC) net computes the partial derivatives required for the coefficient modification, and the partial derivatives of the state vector which may be used by the next block. The N f blocks compute sequentially the Nt outputs {y"} and the partial derivatives {@"'/acij}, in the forward direction (rn = 1 to Nt). The N, errors { P } (computed from the outputs of the last Ncblocks) and the corresponding partial derivatives are used for the computation of the coefficient modifications, which is the sum of N, terms:
Details of the computations are to be found in Appendix 3. For the blocks to be able to perform the above computations, the values of the state inputs SE(n) and of their partial derivatives with respect to the weights must be determined. The choice of these values is of central importance; it gives rise to four families of algorithms.
Neural Networks and Adaptive Filtering
179
4.4.2 Choice of the State Inputs and of Their Partial Derivatives. Choice of thestateinputs: The most "natural" choice of the state inputs of block m is to take the values of the state variables computed by block m-1: S z ( n ) = St;'(n) with Sk(n) = S,!,,,(n - 1). Thus, the trajectory of the network in state space, computed at time n, is independent of the trajectory of the process: the input of block m is not directly related to the actual values of the state variables of the process to be modeled by the network, hence the name undirected algorithm. If the coefficients are mismatched, this choice may lead to large e m r s and to instabilities. Figure 5a shows pictorially the desired trajectory of the state of the network and the trajectory which is computed at time n when an undirected algorithm is used (Nt = 3, N, = 2). We show in the next section that in that case, one must use the forward computation technique to compute the coefficient modifications (Fig. 5b). This choice of the state inputs has been known as the output error approach in adaptive filtering and as the parallel approach in automatic control. It does not require that all state variables have desired values. In order to reduce the risks of instabilities, an alternative approach may be used, called a semidirected algorithm. In this approach, the state of the network is constrained to be identical to the desired state for m = 1:
Sk(n) = [d(n- N t ) l d ( n- Nt - l)l.. . , d ( n - Nt - N + l)] and Sz(n) = S?;'(n). This is possible only when the chosen model is such that desired values are available for all state variables; this is the case for the NARMAX model. Figure 6a shows pictorially the desired trajectory of the state of the network and the trajectory that is computed at time n when a semidirected algorithm is used (Nt = 4, N, = 2). We show in the next section that in that case, one can use the backpropagation technique to compute the coefficient modifications (Fig. 6b). The trajectory of the state of the network can be further constrained by choosing the state inputs ofall blocks to be equal to their desired values:
SE(n) = [d(n - Nt + m - l ) , d ( n - Nt + m - 2), . . . , d ( n - Nt + m
-N)]
for m = 1 to Nt. With this choice, the training is under control of the desired values, hence of the process to be modeled, at each step of the computations necessary for the adaptation (hence the name directed algorithm); therefore, it can be expected that the influence of the mismatch of the model to the process is less severe than in the previous cases. Figure 7a shows pictorially the desired trajectory of the state of the network and the trajectory that is computed at time n when a directed algorithm is used (Nt = N, = 3). We show in the next section that in that case, one can use the backpropagation technique to compute the coefficient modifications (Fig. %). In directed algorithms, all blocks are independent, just as in the case of the training of feedforward networks (Section 4.3); therefore, one has Nt = N,.
180
0. Nerrand et al.
Figure 5: Undirected algorithm (with Nt = 3 and Nc = 2). (a) Pictorial representation of the desired trajectory, and of the trajectory computed at time n, in state space; the trajectory at time n is computed by the blocks shown in b. (b)Computational system at time n. The detail of each block is shown in Figure 4. Note that the output of block 1 has no desired value.
This choice of the values of the state inputs has been known as the equation error approach in adaptive filtering and as the series-parallel approach in automatic control. It is an extension of the teacher forcing technique (Jordan 1985) used for neural network training. If some state inputs do not have desired values, hybrid versions of
Neural Networks and Adaptive Filtering
181
Figure 6: Semidirected algorithm (with Nr = 4 and Nc = 2). (a) Pictorial representation of the desired trajectory, and of the trajectory computed at time n, in state space; the trajectory at time n is computed by the blocks shown in b. (b) Computational system at time n. The detail of each block is shown on Figure 8. Note that the outputs of blocks 1 and 2 have no desired values, but do contribute an additive term to the coefficient modifications. the above algorithms can be used: those state inputs for which no desired values are available are taken equal to the corresponding computed state variables (as in a n undirected algorithm), whereas the other state inputs may be taken equal to their desired values (as in a directed or in a semidirected algorithm). Consistent choices of the partial derivatives of the state inputs: The choices of the state inputs lead to corresponding choices for the initialization
0. Nerrand et al.
182
b-l A d 1 (n)
Ad2(n)
Ad3(n)
(b)
Figure 7 Directed algorithm (with Nf= Nc = 3). (a) Pictorial representation of the desired trajectory, and of the trajectory computed at time n, in state space; the trajectory at time n is computed by the blocks shown in b. (b) Computational system at time n. The detail of each block is shown on Figure 8. Note that in a directed algorithm, each block is independent from the others and must have a desired output value.
of the partial derivatives, as illustrated in Figures 5a, 6a, and 7a. In the case of the undirected algorithm, one has S$(n) = SrG1(n);therefore, a consistent choice of the values of the partial derivatives of the state inputs consists in taking the values of the partial derivatives of the state outputs computed by the previous block:
Neural Networks and Adaptive Filtering
183
except for the first block where one has as!m ( n ) - W"An - 1) ac, dCij In the case of the semidirected algorithm, the state input values of the first block are taken equal to the corresponding desired values; the latter do not depend on the coefficients; therefore, their partial derivatives can consistently be taken equal to zero. The values of the partial derivatives of the state inputs of the other blocks are taken equal to the values of the partial derivatives of the state outputs computed by the previous block. In the case of the directed algorithm, one can consistently take the partial derivatives of the state inputs of all blocks equal to zero. The parameters T, K,, N f , Nc being fixed, the first three algorithms described above are summarized on the first line of each section of Table 1. The first part of the acronyms refers to the choice of the state inputs and the second part refers to the choice of the partial derivatives of the state inputs. They include algorithms which have been used previously by other authors: the real-time recurrent learning algorithm (Williams and Zipser 1989a) is an undirected algorithm (using the forward computation technique) with N f = N, = 1, This algorithm is known as the recursive prediction error algorithm, or IIR-LMS algorithm, in linear adaptive filtering (Widrow and Steams 1985). The teacher-forced real-time recurrent learning algorithm (Williams and Zipser 1989a) is a hybrid algorithm with Nt = Nc = 1. The above algorithms have been introduced in the framework of the fomard computation of the gradient of the cost function. However, the estimation of the gradient of the cost function by backpropagation is attractive with respect to computation time, as mentioned in Section 4.3.4. If this technique is used, the computation is performed with N f blocks, where each coefficient cij is replicated in each block m as cr. Therefore, one has
The training block m is shown in Figure 8: after computing the N, errors using the N f blocks in the forward direction, the N f blocks compute the derivatives of ](n)with respect to the potentials { v y } , in the backward direction. The modification of the coefficients is computed from the N f blocks as
It is important to notice that backpropagation assumes implicitly that the partial derivatives of the state inputs of the first copy are taken equal to zero. Therefore, the backpropagation technique will lead to the same coefficient modifications as the forward propagation technique if and only if it is used
Desired values
Desired values
Desired values
Desired values
Desired values
Desired values
Semidhcted (SD) algorithm
SDD algorithm
SDUD algorithm
Directed algorithm (D) (equationerror) (teacher forcing) (series parallel)
DSD algorithm
DUD algorithm
D e s i i values
Desired values
Desired values
S:;'(n)
S%%)
s:;'(n)
as:,,
acij
-(n as:,,
- 1)
Zero
zero
- 1)
zero
zero
zero
-(n acij
=
- 1)
zero
acij
-as:,, (n
OC,
-(n)
Initialization: partial derivatives for the first block
'In each section, the first line describes the algorithms with consistent choices of the state inputs.
SLAn - 1) s:ilw
SZ1(n)
1)
S:u,(n -
s:;W
S:dn - 1)
SE(n) =
sk(n) =
UDSD algorithm
Undirected (UD) algorithm (output error) (parallel) U D D algorithm
State input of a current block
Initialization: state input of the first blodc
Table 1: Three Families of Algorithms for the Training of Feedback Neural Networks'
as%;' -@acij I
zero
- (acij n)
=
as:;:;' acij ( n ) zero as:;:;'
- (acij n)
as:;'
zero
as:;' -@Iacij
-as2 (n) acij
Partial derivatives for a current blodc
Neural Networks and Adaptive Filtering
185
Figure 8: Training block m at time n with a desired output value: computation of a partial coefficient modification using the backpropagation technique for the estimation of the gradient for a feedback neural network. If block m has no desired value, then P = 0, but it does contribute an additive term to the coefficient modification. It should be noticed that forward propagation through all blocks must be performed before backpropagation.
within algorithms complying with this condition, that is, within directed or semidirected algorithms (Figs. 6b and 7b);backpropagationcannot be used consistently within undirected and hybrid algorithms. When both backpropagation and forward computation techniques can be used, backpropagation is the best choice because of its lower computational complexity. An example of the use of a directed algorithm for identification and control of nonlinear processes can be found in Narendra and Parthasarathy (1990). Other choices of the partial derivatives of the state inputs: Because adaptive neural networks require real-time operation, tradeoffs between consistency and computation time may be necessary: setting partial derivatives i3S;/i3cy equal to zero may save time by making the computation by backpropagation possible even for undirected algorithms (UD-Dor UD-SDalgorithms). The full variety of algorithms is shown on Table 1: in each group, the first line shows the characteristics of the fully con-
186
0. Nerrand et al.
sistent algorithm, whereas the other two lines show other possibilities which are not fully consistent, but which can nevertheless be used with advantage. The SD-UD, D-SD,and D-UD algorithms have been included for completeness: computation time permitting, the accuracy of the computation may be improved by setting the partial derivatives of the state inputs to nonzero values in the directed or semidirected case. Undirected algorithms have been in use in linear adaptive filtering: the extended L M S algorithm is a UD-D algorithm (see Table 1) with Nf = N, = 1 (Shynk 1989); the u posteriori error algorithm is also a UD-D algorithm with Nt = 2, N, = l (Shynk 1989). The truncated buckpropagution through time algorithm (Williams and Peng 1990) is a UD-D algorithm with N, = 1 and Nt > 1, with a special feature: to save computation time, the coefficients of the blocks 1 to Nf - 1 are the coefficients that were computed at the corresponding times. 5 Conclusion
The present paper provides a comprehensive framework for the adaptive training of neural networks, viewed as nonlinear filters, either transversal or recursive. We have introduced the concept of canonical form of a neural network, which provides a undying view of network architectures and allows a general description of training methods based on gradient estimation. We have shown that backpropagation is always advantageous for training feedforward networks adaptively, but that it is not necessarily the best method for training feedback networks. In the latter case, four families of training algorithms have been proposed; some of these algorithms have been in use in classical linear adaptive filtering or adaptive control, whereas others are original. The unifying concepts thus introduced are helpful in bridging the gap between neural networks and adaptive filters. Furthermore, they raise a number of challenging problems, both for basic and for applied research. From a fundamental point of view, general approaches to the convergence and stability of these algorithms are still lacking; a preliminary study along these lines has been presented (Dreyfus et ul. 1992); from the point of view of applications, the real-time operation of nonlinear adaptive systems requires specific silicon implementations, thereby raising the questions of the speed and accuracy required for the computations.
Appendix 1 We consider a discrete-time neural network with any arbitrary structure, and its associated network graph as defined in Section 3.2. The set of state variables is the minimal set of variables that must be initialized to allow the computation of the state of all neurons at any
Neural Networks and Adaptive Filtering
187
time n > 0, given the values of the external inputs at all times from 0 to n. The order of the network is the number of state variables. Clearly, the only neurons whose state must be initialized are the neurons which are within loops (i.e., within cycles in the network graph). Therefore, to determine the order of the network, the network graph should be pruned by suppressing all external inputs and all edges which are not within cycles (this may result in a disconnected graph). To determine the order, it is convenient to further simplify the network graph as follows: (1) merge parallel edges into a single edge whose delay is the maximum delay of the parallel edges; (2) if two edges of a loop are separated by a neuron that belongs to this loop only, suppress the neuron and merge the edges into a single edge whose delay is the sum of the delays of the edges. We now consider the neurons which are still represented by nodes in the simplified network graph. We denote by N the order of the network. If, for each node i of the simplified graph, we denote by Ai the delay of the synapse, afferent to neuron i, which has the largest delay (i.e., the weight of the edge directed toward i, which has the largest weight), then a simple upper bound for N is given by i
The state Xi of a neuron i, which has an afferent synapse of delay Aj, cannot be computed at times n < Ai; the computation of the states of the other neurons may require the values of xi at times 0,1,. . . ,Ai - 1; thus, the contribution of neuron i to the order of the network is smaller than or equal to Aj. Let the quantity wi be defined as wi = Ai - min(Aj - 7 j . j ) jER,
if Ai - min(A, - 7j.i) > 0 jERi
wi = 0 otherwise where Ri stands for the set of indices of the nodes that are linked to i by an edge directed from i to j (i.e., the set of neurons to which neuron i projects efferent synapses). Then the order of the network is given by
N=Cwj i
The necessity of imposing the state of neuron i at time k (0 < k < Ai - 1) depends on whether this value is necessary for the computation of the state of a neuron j to which neuron i sends its state: if k 7j.i is smaller than the maximum delay A, of the synapses afferent to j, it is not necessary to transmit the state of neuron i at time k to neuron j , since the latter does not have the information required to compute its state at time k + qi; the information on the state of neuron i at time k is necessary only
+
0. Nerrand et al.
188
if one has k 2 Aj - 7j.j. Therefore, the minimum number of successive values required for neuron i is equal to
Ai - min(Aj - 7j.J FRi
if Ai - @(A, 1ERi
- 7j.i)
> 0,
zero otherwise
Clearly, this result is in accord with the upper bound given above. The above results determine the number of state variables related to each neuron. The choice of the set of state variables is not unique. The presence of parallel edges within a loop, or the presence of feedforward connections between loops, may require the replication of some neurons and of some coefficients. Figure Al.la shows a feedback network and Figure Al.lb shows its canonical form; the order of the network is 6. The example shows that some weights are replicated.
Appendix 2 This appendix describes several architectures of feedback neural networks that have been proposed in the literature. We present their canonical form, so that they can be easily compared. The discrete-time mathematical model of a time-invariant dynamical process is of the form
where vector U is the input of the dynamical system, vector S denotes the state of the system, and vector Y is the output of the system. Since neural networks with hidden neurons are able to approximate a large class of nonlinear functions, they can be used for implementing functions cp and Q. The network proposed by Jordan (1986) is trained to produce a given sequence y(n) for a given constant input P ("plan"). Thus it is used as an associative memory. The network and its canonical form are shown in Figure A2.1. The representation of the network under its canonical form shows that the network is of order 2, although the representation used by Jordan exhibits four connections with unit delays. Note that the state variables are not delayed values of the output. The presence of hidden neurons allows this network to learn any function y(n) = Q[S(n),U(n)]. The network suggested by Elman (1988) is used as a nonlinear filter. Its canonical form is shown on Figure A2.2.
Figure Al.1: (a) Example of a feedback neural network. Numbers in rectangles are synapse delay values, u is the external input, and y is the output of the network. (b) Canonical form of the network ( E = 8, M = 2, N = 6). The cij,. notation of relation 3 is wed.
0. Nerrand et al.
190
Figure A2.1: (a) Network architecture proposed by Jordan. (b)Canonical form.
4
4
Figure A2.2 Canonical form of the network architectureproposed by Elman. Each state variable is computed as a fixed nonlinear function f of a weighted sum of the external inputs and state inputs. Therefore, the class of functions cp that can be implemented is restricted to the form:
v")W)l , = f[AS(fl) + W n ) l where A and B are the synaptic matrices. SimiIarly, the output is computed as a fixed nonlinear function f of a weighted sum of the state
Neural Networks and Adaptive Filtering
191
variables, so that the class of functions @ that can be implemented is restricted to
@[S(n),W)l = f[CS(n)l where C is the synaptic matrix. The network proposed in Williams and Zipser (1989a) and Williams and Peng (1990) is used as a nonlinear filter. The state of the network at time n 1 is computed as a weighted sum of the inputs and of the state values at time n, followed by a fixed nonlinearityf i . As a result, the network can only implement nonlinear functions of the form fi[AS(n)+ BU(n)l. The network used by Poddar and Unnikrishnan (1991) consists of a “feedforward network of pairs of neurons; each neuron, except the output neuron, and each external input is associated to a “memory neuron.” If xi(n) is the value of the output of neuron i and x,(n) the value of the output of the associated memory neuron j at time n, the output of the memory neuron at time n + 1 is
+
xj(n + 1) = aixi(n) + ( 1 - ai)xj(n),
0 < ai I 1
If ai = 0, the memory neurons introduce only delays, so that the network is a nonlinear transversal filter. If ai # 0, the memory neurons are linear low-pass first order filters, and the network is actually a feedback network. A state output is associated to each memory neuron. Figure A2.3a shows an example of such an architecture where neurons 3,4,7, and 8 are the memory neurons associated to the two inputs 1 and 2 and to the two neurons 5 and 6, respectively. The canonical form is shown in Figure A2.3b where x3, x4, x7, x8 are chosen as state variables. For process identification and control problems, the most general structure used by Narendra and Parthasarathy (1991) is a model of the specific form y(n) = @ I [ u ( n - l ) , u ( n - 2 ) ,
...I+
@p[y(n-l),y(n-2),
...I
where 91 and \Tr2 are implemented by MLP networks with 20 neurons in the first hidden layer and 10 neurons in the second hidden layer. Appendix 3
For simplicity, we present the training of the fully connected neural net of Figure 2 we denote the external inputs by z1 to ZM! the feedback inputs by Z M + ~ to Z M + N , and the outputs of the neurons by ZM+N+~ to ZM+N+“ (where v is the number of neurons). The neurons are ordered in the following way: the pth neuron receives the outputs of neurons indexed 4 < p (fully connected).
IE
F 4-
I
Neural Networks and Adaptive Filtering
193
At time n, we have to consider the following cost function:
J(n)= I
5
(ey
m=N,-N,+I
where Nt is the number of blocks used to compute the N, values e"' (Nt 2 Nc). In this appendix, we present the contribution of block m (1 5 m 5 Nf) to the gradient estimation. This contribution is computed from the external input vector, the desired value and the state input vector. We denote the available values of the coefficients at time n by { q } . The canonical FF net of the mth block, with coefficients { c y } = {cij}, computes the outputs zy = fi(v7) of all neurons and the state output vector Stut(n) from the external input vector
If"(#)
+
n Nt m ) , u ( n- Nf = [ ~ (= [z?, zz", . . . ,Z G ]
+ m - 'I), . . . ,u(n - Nt + m - M + l ) ]
and the state input vector S g ( n ) = [~;t;+~, Z Z + ~. ., . ,z;t;+N]
as follows:
1. For i = 1 to M (external inputs): zy
= u(n
2. For i = M
- Nt + m - i + 1)
+ 1 to M + N (state inputs):
zy is given by the chosen algorithm (Table 1)
3. For i = M
+ N + 1to M + N + v - 1 (hidden neurons):
4. For i = M
+ N + v (linear output neuron):
Thus, the state output vector is szut(n)
[ZG+N+vr Z G + N + v + l ,
*.
m ZM+N+v+N-ll
.?
= [Y"? z k + l , .. . I z ~ + N - l l
and, ifN,-N,+l 5 m 5 Nt, weobtain from the desired valued(n-Nt+m) and the output ym: em = d ( n - Nf + m ) - ym.
0.Nerrand et al.
194
In the following, we present two methods for the computation of the gradient of J(n): the forward computation and the backpropagation techniques. 1. Forward Computation (Fig.4): We consider the whole set of Nfblocks as a static network on which we perform the forward computation technique. It is based on the following relation:
The linear FC net of the mth block computes, with coefficients {q}and cf:(q)}l the set of partial derivatives of the state output (including y") with respect to all coefficients Cjj: aSk:,/acjj(n). For the v(M + N) + (v - l)v/2 coefficients cq (i > 17: 1. For p = 1 to M (external inputs):
2. For p = M
+ 1 to M + N (feedback inputs):
az; - is given by the chosen algorithm (Table 1 ) acij
3. F o r p = M + N + l t o M + N + v - l (hiddenneurons):
4. For p = M
+ N + v (linear output neuFon):
Thus the partial derivatives of the state output are given by:
Neural Networks and Adaptive Filtering
195
Once all partial derivatives of the output values y" are computed for the Nt blocks, the gradient of J ( n ) is obtained from
If the steepest-descent method is used, the coefficient modifications are given by Aci,(n) = -p- a'(n)
e"-w
=p m=Nt-N,+l
- C
-
Nt
Ac!(n)
m=Nt-N,+I
2. Backpropgufwn (Figure 8). Considering the effect of the coefficient ci, only, one has
thus
Then the gradient of J ( n )can be written as
where
This means that standard backpropagation can be applied to the whole set of NIblocks considered as a static network with replicated coefficients. The linear BP net of the mth block computes, with coefficients {cT} and { fi(q)},the set of partial derivatives of J(n) with respect to the potentials of all neurons. We define the following set of variables 97: 1. f o r i = M + N + v + N - 1 d o w n t o M + N + v + l : if m = Nt then qT = 0, otherwise L$" = 9L&+l
0. Nerrand et al.
196
2. for i = M
+ N + v (linear output neuron): +
if m = Nt then qy = e"', otherwise qy = em ; : ;4 note that qy = -3. for i = M
+ N + v - 1 down to M + N + 1 (hidden neurons):
where Ri is the set of indices of the neurons to which the ith neuron transmits its output
4. for i = M
+ N (last feedback input):
qr =
c c;qr
hERi
5. for i = M
q?
+ N - 1 down to M + 1 (other feedback inputs):
=
c chmi@
f qZN+v
hERi
Note that computation by backpropagation assumes implicitly that the derivatives of the feedback inputs of the first block (m = 1)with respect to the coefficients are equal to zero; this is in contrast to the forward computation of the gradient, where these values can be initialized arbitrarily. Note also that with the forward computation technique, the number of partial derivatives to compute for each block is v[vM (v - l)v/2] whereas with the backpropagation method this number is v. Once all partial derivatives of ](n) with respect to the potentials vy of all neurons are computed for the Nt blocks, the gradient of I(n) is obtained from
+
If the steepest-descent method is used, the coefficient modifications are given by
Neural Networks and Adaptive Filtering
197
Acknowledgments
The authors are very grateful to 0. Macchi for numerous discussions that have been very helpful in putting neural networks into the perspective of adaptive filtering. C. Vignat has been instrumental in formalizing some computational aspects of this work. We thank H. Gutowitz for his critical reading of the manuscript. This work was supported in part by EEC Contract ST2JO312C.
References Applebaum, S. P.,and Chapman, D. J. 1976. Adaptive arrays with main beam constraints. IEEE Trans. Antennas and Propagation, AP-24, 650-662. Bellanger, M. G. 1987. Adaptive Digital Filters and Signal Analysis: Marcel Dekker, New York. Chen, S., and Billings S. A. 1989. Representations of non-linear systems: the NARMAX model. Int. 1.Control 49,1013-1032. Chen, S., Gibson, G. J., Cowan, C. E N., and Grant, P. M. 1990. Adaptive equalization of finite nonlinear channels using multilayer perceptrons. Signal Process. 20, 107-119. D r e y h , G., Macchi, 0, Marcos, S., Personnaz, L., Roussel-Ragot, P., Urbani, D., and Vignat, C. 1992. Adaptive training of feedback neural networks for non-linear filtering and control. In: Neural Networks for Signal Processing 11, S. Y. Kung, F. Fallside, J. Aa. Sorenson, C. A. Kamm, eds., pp. 550-559. IEEE. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Fallside, F. 1990. Analysis of linear predictive data as speech and of ARMA processes by a class of single-layer connectionistmodels. In Neurocomputing: Algorithms, Architectures and Applications, F. Fogelman-Soulie and J. Herault, eds., pp. 265-283, Springer. Haykin, S. 1991. Adaptive Filter Theory: Prentice-Hall International Editions, Englewood Cliffs,NJ. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Hornik, K., Stinchcombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2,359-366. Jayant, N. S., and Noll, P. 1984. Digital Coding of Waveforms. Principles and Applications to Speech and Video. Signal Processing Series, A. Oppenheim, Ed. Prentice-Hall, Englewood Cliffs, NJ. Jordan, M. I. 1985. The Learning of Representations for Sequential Performace. Doctoral Dissertation, University of California, San Diego. Jordan, M. I. 1989. Serial order: A parallel, distributed processing approach. Proc. Eighth Annu. Conf. Cog. Sci. SOC., 531-546. Lapedes, A., and Farber, R. 1988. How neural nets work. In Neural Information
198
0.Nerrand et al.
Processing Systems, D. Z. Anderson, ed., pp. 442-456, American Institute of Physics. McCannon, T. E., Gallagher, N. C., Minoo-Hamedani, D., and Wise, G. L. 1982. On the design of nonlinear discrete-time predictors. IEEE Trans. Inform. Theory 28,366-371. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks 1,4-27. Narendra, K.S.,and Parthasarathy,K. 1991. Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Trans. Neural Networks 2, 252-262. Nicolau, E.,and Zaharia, D. 1989. Adaptive arrays. In Studies in Electrical and Electronic Engineering 35. Elsevier, Amsterdam. Pearlmutter, B. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1, 263-269. Personnaz, L., Guyon, I., and Dreyfus, G. 1986. Collective computational prop erties of neural networks: New learning mechanisms. Phys. Rev. A 34,42174228. Picinbono, B. 1988. Adaptive methods in temporal processing. In U n d e m t e r Acoustic Data Processing, Y. T. Chan, ed., pp. 313-327. Kluwer Academic Publishers, Dordrecht. Pineda, F. 1987. Generalization of backpropagation to recurrent neural networks. Phys. Rev. Lett. 59,2229-2232. Pineda, F. J. 1989. Recurrent backpropagation and the dynamical approach to adaptive neural computation. Neural Comp. 1,161-172. Poddar, P., and Unnikrishnan, K. P. 1991. Non-linear prediction of speech signals using memory neuron networks. In Neural Networks for Signal Processing, Proceedings of the 2991 IEEE Workshop, B. H. Juang, s. Y. Kung, and C. A. Kamm, eds., pp. 395-404. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T. 1986. Numerical Recipes. Cambridge University Press, Cambridge. Proakis, J. G. 1983. Digital Communications. McGraw-Hill, New York. Robinson, A. J., and Fallside, F. 1989. A dynamic connectionist model for phoneme recognition. In Neural Networks from Models to Applications, L. Personnaz and G. Dreyfus, eds., pp. 541-550. IDSET, Paris. Rumelhart, D., Hinton, G., and Williams, R. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. Rumelhart and J. McClelland, eds. MIT Press, Cambridge. Shynk, J. J. 1989. Adaptive IIR filtering. IEEE ASSP Mag. April, 4-21. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K.1989. Phoneme recognition using time-delay neural networks. IEEE Trans. Acoustics, Speech, Signal Process. 37, 328-339. Weigand, A. S., Huberman, B. A., and Rumelhart, D. E. 1990. Predicting the future: A connectionist approach. Int. 1.Neural Syst. 1,193-209. Widrow, B., and Steams, S. D. 1985. Adaptiw Signal Processing. Prentice-Hall, Englewood Cliffs, NJ. Williams, R. J., and Zipser, D. 1989a. A learning algorithm for continually
Neural Networks and Adaptive Filtering
199
running fully recurrent neural networks. Neural Comp. 1,270-280. Williams, R. J., and Zipser, D. 1989b. Experimental analysis of the real-time recurrent learning algorithm. Connect. Sci. 1, 87-111. Williams, R. J., and Peng, J. 1990. An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Comp. 2,490-501. Received 30 January 1992; accepted 21 August 1992.
This article has been cited by: 1. L. Saad Saoud, A. Khellaf. 2010. Nonlinear dynamic systems identification based on dynamic wavelet neural units. Neural Computing and Applications 19:7, 997-1002. [CrossRef] 2. A. Yilmaz, K. E. Akdogan, M. Gurun. 2009. Regional TEC mapping using neural networks. Radio Science 44:3. . [CrossRef] 3. Chi Sing Leung, Ah Chung Tsoi. 2006. Combined learning and pruning for recurrent radial basis function networks based on recursive least square algorithms. Neural Computing and Applications 15:1, 62-78. [CrossRef] 4. Su Lee Goh , Danilo. P. Mandic . 2004. A Complex-Valued RTRL Algorithm for Recurrent Neural NetworksA Complex-Valued RTRL Algorithm for Recurrent Neural Networks. Neural Computation 16:12, 2699-2713. [Abstract] [PDF] [PDF Plus] 5. Shih-Lin Hung, C. S. Huang, C. M. Wen, Y. C. Hsu. 2003. Nonparametric Identification of a Building Structure from Experimental Data Using Wavelet Neural Network. Computer-Aided Civil and Infrastructure Engineering 18:5, 356-368. [CrossRef] 6. Danilo P. Mandic . 2002. Data-Reusing Recurrent Neural Adaptive FiltersData-Reusing Recurrent Neural Adaptive Filters. Neural Computation 14:11, 2693-2707. [Abstract] [PDF] [PDF Plus] 7. P.A. Mastorocostas, J.B. Theocharis. 2002. A recurrent fuzzy-neural model for dynamic system identification. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 32:2, 176-190. [CrossRef] 8. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 9. Paolo Campolucci , Aurelio Uncini , Francesco Piazza . 2000. A Signal-Flow-Graph Approach to On-line Gradient CalculationA Signal-Flow-Graph Approach to On-line Gradient Calculation. Neural Computation 12:8, 1901-1927. [Abstract] [PDF] [PDF Plus] 10. L.S.H. Ngia, J. Sjoberg. 2000. Efficient training of neural nets for nonlinear adaptive filtering using a recursive Levenberg-Marquardt algorithm. IEEE Transactions on Signal Processing 48:7, 1915-1927. [CrossRef] 11. I. Rivals, L. Personnaz. 2000. Nonlinear internal model control using neural networks: application to processes with delay and design issues. IEEE Transactions on Neural Networks 11:1, 80-90. [CrossRef] 12. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef]
13. D. P. Mandic, J. A. Chambers. 1999. A posteriori error learning in nonlinear adaptive filters. IEE Proceedings - Vision, Image, and Signal Processing 146:6, 293. [CrossRef] 14. Gérard Dreyfus , Yizhak Idan . 1998. The Canonical Form of Nonlinear Discrete-Time ModelsThe Canonical Form of Nonlinear Discrete-Time Models. Neural Computation 10:1, 133-164. [Abstract] [PDF] [PDF Plus] 15. G.C. Mouzouris, J.M. Mendel. 1997. Dynamic non-Singleton fuzzy logic systems for nonlinear modeling. IEEE Transactions on Fuzzy Systems 5:2, 199-208. [CrossRef] 16. I. Kamwa, R. Grondin, V.K. Sood, C. Gagnon, Van Thich Nguyen, J. Mereb. 1996. Recurrent neural networks for phasor detection and adaptive identification in power system control and protection. IEEE Transactions on Instrumentation and Measurement 45:2, 657-664. [CrossRef] 17. Eric A. Wan , Françoise Beaufays . 1996. Diagrammatic Derivation of Gradient Algorithms for Neural NetworksDiagrammatic Derivation of Gradient Algorithms for Neural Networks. Neural Computation 8:1, 182-201. [Abstract] [PDF] [PDF Plus] 18. Tsungnan Lin, B.G. Horne, P. Tino, C.L. Giles. 1996. Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks 7:6, 1329-1338. [CrossRef] 19. C.W. Omlin, C.L. Giles. 1996. Rule revision with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 8:1, 183-188. [CrossRef]
NOTE
Communicated by William W. Lytton
Fast Calculation of Synaptic Conductances Rajagopal Srinivasan Department of Electrical Engineering, Case Western Reserve University, Cleveland, OH 44206 USA Hillel J. Chiel, Departments of Biology and Neuroscience, Case Western Reserve University, Cleveland, OH 44106 USA Synaptic conductances are often modeled as sums of cr functions
where t is the current time, ti is the time of the ith spike in the presynaptic neuron, and T is the time constant of the synapse. If the time of decay of the synapse, TD, is not equal to its time of onset, TO, the conductance at time t after k spikes have occurred is
The drawback of these solutions is that one must keep track of the times of occurrences of each spike that initiated the synaptic potentials, and recalculate each exponential in the summation at each time step. This creates a large storage and computational overhead. Since both these equations represent the impulse response of a second-order differential equation, another approach is to numerically integrate additional differential equations for each synapse in the network (Wilson and Bower 1989). We have developed an improved method for computing synaptic conductances that separates equations 1 and 2 into two components: one that is a function of the current time of the simulation and one that accumulates the contributions of previous spike events to the synaptic conductance. We demonstrate that this method requires only the storage of two running sums and the time constants for each synapse, and that it is mathematically equivalent to equations 1 and 2. We will then demonstrate that it is also faster for a given level of precision than numerically integrating differential equations for each synapse. We will first describe our algorithm for equation 1, and then for equation 2. Neurul Computation 5,200-204 (1993) @ 1993 MassachusettsInstitute of Technology
Fast Calculation of Synaptic Conductances
201
Equation 1 can be rewritten as follows:
+
When the k 1st spike occurs at time tk+l, single terms can be added to each of the two summations in brackets to update them, eliminating the need to store spike times. To keep the exponentials inside the summation and outside the brackets from growing too large or small (respectively) as t increases over time, the exponentials can be rescaled. The left-hand term can be rewritten as
(4)
We will refer to the terms within the summation as Suml(tk). It can be updated once the k + 1st spike occurs as follows:
sum1( f k + l ) = e - ( f k + l - t k ) / T because after the k
Sum1(fk)
+I
(5)
+ 1st spike occurs, k
1
The right-hand term in equation 3 can also be rewritten as
I
i=l
+
Once the k 1st spike occurs, the new form of the terms within the summation [which we refer to as sud(tk+l)]would be
c
c k
k+l
tie(ti++l)/T
= e-(b+l-k)/T
tie-(tk-'g)/T
+ tkfl
(8)
i= 1
i=l
so that Sud(tk) is updated using the following rule: sud(tk+l) = e-(fk+l-tk)/T Sud(tk) + t k + l
(9)
Thus, at time t > f k + l , from equations 3-9, the synaptic conductance is equal to e-(t-fk+l)/r
r
[t(Suml(tk+l))- (sud(tk+l))]
(10)
Rajagopal Srinivasan and Hillel J. Chiel
202
The conductance needs to be evaluated at each time step of the simulation, that is, from time t to time t + At. This can be accomplished by which yields the term multiplying the term at time t, e-(r-tk+l)/+, by cAt/+, Thus, the conductances can be updated as at time t At, e-[('+A')-tk+ll/T. follows:
+
Suml(t + At) = e-*'/' Suml(t) + S Sum2(t + At) = e-At/r Sum2(t)+ St where S = 1 if a spike occurred at time t and S = 0 otherwise. The synaptic conductance can then be calculated at time t At from
+
1 g = -[(t 7
+ At) Suml(t + At) - Sum2(t + At)]
(13)
Equations 11, 12, and 13 summarize our algorithm for updating equation l, which requires storage of only Suml, Surd, and T . Since these equations are mathematically identical to equation 1, the accuracy of this method does not depend on the step size At, unless the step size becomes so large that spikes are missed. Of course, this constraint on step size is true for equation 1 as well. By the same logic, equation 2 can be updated as follows: Sum3(t + At) = e-A'"%un3(t) + S Sum4(t + At) = e-A'/roSum4(t)+ S where S = 1 if a spike occurred at time t and S = 0 otherwise, Sum3(t) = cb, e-(f-ti)/w,and Sum4(t) = e-(t-fl)/ro. The synaptic conductance can then be calculated at time t At from mTo [Suml(t + At) - Sum2(t At)] (16) g=
$, +
I, ,[
+
Determining the value of the conductance requires only that Suml,Sum2, TO, and TD be saved for each synapse. In addition, TDTO/(TD - TO) is a constant for a given synapse, and therefore can be precalculated for each synapse. This method requires far fewer exponentiations and additions than does the original closed-form solution (compare equations 1 and 2 to equations 11-16). Furthermore, the accuracy of our method is limited only by the machine precision. It also requires far less memory storage to maintain this accuracy. The number of spikes that must be stored to maintain the precision of equations 1 or 2 depends on (1) the spike frequency, (2) the synaptic time constant, and (3) the required precision level, E . To ensure that a spike that has occurred at some time to in the past will add less than E to equation 1, it must be true that ( ( t- to/t)e < E. It can be shown that, for E < setting (t - tO)/r equal to P ( E )= ln(l/E)
+ (1+ l / l n ( l / ~ ) ] [ ~ ( l n ( l / ~ ) ) ]
(17)
Fast Calculation of Synaptic Conductances
203
will always satisfy this constraint. For example, if E < the value of P ( E )would be 16.63, which implies that t-to must be equal to 16.637, that is, spike to must be stored until a time of 16.637 has elapsed, after which its contribution to equation 1 will be less than If the time constant T of this synapse is 50 msec, this requires that a spike be stored for 831.5 msec. The worst case size of the storage queue for a synapse would then be determined by this storage time, divided by the minimum period between spikes (which determines the maximum number of spikes that may occur during this storage time). If the input cell spikes with a minimum period of 100 msec between spikes, the queue would have to have room to store 831.5/100 M 9 spike times. In general, Max queue size = P(E)T/Tmin where Th,, is the minimum firing period of the input cell. Thus, storage requirements for equation 1 or 2 increase logarithmically with increasing precision [since, as E decreases, the fastest growing term in equation 17 is ln(l/~)I,increase linearly with the synaptic time constant 7, and increase inversely with the minimum firing period Tmh. If one chooses to implement a queue dynamically, one has the computational overhead of keeping track of which spikes have aged sufficiently to be dropped. Whether one uses a fixed size array or a dynamic array, one must use more storage as the precision, input firing frequency, or time constant of a synapse increases. How does our method compare to numerically integrating a secondorder differential equation, injecting new impulses each time an action potential occurs in the presynaptic neuron? A variety of techniques exist for numerically integrating differential equations (Press et al. 1988). One efficient, stable, and fairly accurate technique that is frequently employed is referred to as the exponential technique. For the second-order differential equation yielding equation 1 or 2, applying this technique yields the following finite difference equations (Wilson and Bower 1989, p. 328):
where x ( t ) is nonzero at the time a spike occurs, and zero otherwise. A drawback of this approach is that it is not inherently as precise as our method. One must trade off speed versus precision for equations 19-20. For example, if we choose to use a 1 msec time step for our method, in order to guarantee that deviations from it are smaller than we found that this numerical integration technique must be run with a step size 4 times smaller; to obtain deviations smaller than requires a step size 8 times smaller, and to obtain deviations smaller than requires a step size 16 times smaller. Of course, we could choose a larger step size for our method without loss of accuracy (see above), and the step sizes
204
Rajagopal Srinivasan and Hillel J. Chiel
for the numerical integration technique would then be proportionally smaller. We directly compared the time taken by the three methods by writing three benchmark programs in C (code listings available from the authors on request), and timing them on a Decstation 5000/200. As a reasonable precision limit for the methods, we chose For equation 1 and equations 11-13, a l msec step size was used. For equations 19-20, a l msec step size gave relatively poor precision (deviations were on the order of a step size of 0.125 msec was necessary to limit deviations to less than from the other two methods. The time constant for the synapse was 50 msec, the input spike frequency was 10 Hz, and we chose a queue size for the method of equation 1 that would maintain spike times until they had decayed to values smaller than (from equation 18, we determined that the queue should hold 9 spike times). Simulation time was 200 sec. Using these parameters, our method (equations 11-13) required only 2.1 sec of real time, whereas the method of equation 1 required 15.8 sec of real time, and the method of equations 19-20 required 18.1 sec of real time. These results suggest that our method is superior both in terms of speed and accuracy to previous methods. Acknowledgments
R. S. was supported by a Research Experience for Undergraduates Supplement to H. J. C.’s NSF Grant, BNS 88-10757.H. J. C.thanks the NSF for its support of this research. We are grateful for the comments of Dr.Randall Beer and two anonymous reviewers on an earlier draft of this manuscript. References Press,W. H., Flannery, B. P., Teukolsky, S. A,, and Vetterling, W. T. (eds.) 1988. Numerical Recipes in C . Cambridge University Press, Cambridge. Wilson, M. A., and Bower, J. M. 1989. The simulation of large-scale neural networks. In Methods in Neuronal Modeling, C. Koch and I. Segev, eds., pp. 291333. MIT Press, Cambridge. Received 11 February 1992; accepted 3 August 1992.
This article has been cited by: 1. Mark E. Nelson. 2010. Electrophysiological models of neural processing. Wiley Interdisciplinary Reviews: Systems Biology and Medicine n/a-n/a. [CrossRef] 2. Hans E. Plesser, Markus Diesmann. 2009. Simplicity and Efficiency of Integrate-and-Fire Neuron ModelsSimplicity and Efficiency of Integrate-and-Fire Neuron Models. Neural Computation 21:2, 353-359. [Abstract] [Full Text] [PDF] [PDF Plus] 3. Jan Reutimann , Michele Giugliano , Stefano Fusi . 2003. Event-Driven Simulation of Spiking Neurons with Stochastic DynamicsEvent-Driven Simulation of Spiking Neurons with Stochastic Dynamics. Neural Computation 15:4, 811-830. [Abstract] [PDF] [PDF Plus] 4. Maurizio Mattia , Paolo Del Giudice . 2000. Efficient Event-Driven Simulation of Large Networks of Spiking Neurons and Dynamical SynapsesEfficient Event-Driven Simulation of Large Networks of Spiking Neurons and Dynamical Synapses. Neural Computation 12:10, 2305-2329. [Abstract] [PDF] [PDF Plus] 5. Michele Giugliano . 2000. Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network SimulationsSynthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations. Neural Computation 12:4, 903-931. [Abstract] [PDF] [PDF Plus] 6. Michele Giugliano , Marco Bove , Massimo Grattarola . 1999. Fast Calculation of Short-Term Depressing Synaptic ConductancesFast Calculation of Short-Term Depressing Synaptic Conductances. Neural Computation 11:6, 1413-1426. [Abstract] [PDF] [PDF Plus] 7. J. Köhn , F. Wörgötter . 1998. Employing the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network SimulationsEmploying the Z-Transform to Optimize the Calculation of the Synaptic Conductance of NMDA and Other Synaptic Channels in Network Simulations. Neural Computation 10:7, 1639-1651. [Abstract] [PDF] [PDF Plus] 8. William W. Lytton . 1996. Optimizing Synaptic Conductance Calculation for Network SimulationsOptimizing Synaptic Conductance Calculation for Network Simulations. Neural Computation 8:3, 501-509. [Abstract] [PDF] [PDF Plus] 9. Alain Destexhe, Zachary F. Mainen, Terrence J. Sejnowski. 1994. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. Journal of Computational Neuroscience 1:3, 195-230. [CrossRef] 10. A. Destexhe , Z. F. Mainen , T. J. Sejnowski . 1994. An Efficient Method for Computing Synaptic Conductances Based on a Kinetic Model of Receptor BindingAn Efficient Method for Computing Synaptic Conductances Based on a Kinetic Model of Receptor Binding. Neural Computation 6:1, 14-18. [Citation] [PDF] [PDF Plus]
Communicated by David Willshaw
NOTE
The Variance of Covariance Rules for Associative Matrix Memories and Reinforcement Learning Peter Dayan Terrence J. Sejnowski Computational Neurobiology Laboratory, The Salk Institute, P. 0.Box 85800, San Diego, CA 92186-5800 USA Hebbian synapses lie at the heart of most associative matrix memories (Kohonen 1987; Hinton and Anderson 1981) and are also biologically plausible (Brown et al. 1990; Baudry and Davis 1991). Their analytical and computational tractability make these memories the best understood form of distributed information storage. A variety of Hebbian algorithms for estimating the covariance between input and output patterns has been proposed. This note points out that one class of these involves stochastic estimation of the covariance, shows that the signal-to-noise ratios of the rules are governed by the variances of their estimates, and considers some parallels in reinforcement learning. Associations are to be stored between R pairs [a(w),b(w)] of patterns, where a(w) E (0, and b(w) E (0,l}”,using the real-valued elements of an m x n matrix W . Elements of a(w) and b(w) are set independently with probabilities p and r, respectively, of being 1. A learning rule specifies how element Wi, changes in response to the input and output values of a particular pair-the model adopted here (from Palm 1988a,b) considers local rules with additive weight changes for which: n
Wij =
C Aij(w),
where Aij(w) = f [ai(w), bj(w))
W=l
and f can be represented as [a,p, y,61 based on
One way to measure the quality of a rule is the signal-to-noise ratio (S/N) of the output of a single “line” or element of the matrix, which is a measure of how well outputs that should be 0 can be discriminated from outputs that should be 1. The larger the S/N, the better the memory will perform (see Willshaw and Dayan, 1990 for a discussion). A wide variety of Hebbian learning rules has been proposed for heteroNeural Computation
5,205-209 (1993)
@ 1993 Massachusetts Institute of Technology
Peter Dayan and Terrence Sejnowski
206
and autoassociative networks (Kohonen 1987; Sejnowski 1977a; Hopfield 1982; Perez-Vincente and Amit 1989; Tsodyks and Feigel'man 1988). The covariance learning rule fcov = [PT, -p(l - T), -(1 - p ) ~(1, - p)(l - T)], has the highest S/N (Willshaw and Dayan 1990; Dayan and Willshaw 1991); however, both it and a related rule, fprd = [-pr, -pr, -pr, 1 - pr] (Sejnowski1977a),have the drawback that (Y # 0, that is, a weight should change even if both input and output are silent. Note the motivations behind these rules: fcov
-
fprd
(input - p ) x (output - T) input x output - pr
Alternative rules have been suggested that better model the physiological phenomena of long-term potentiation (LTP) and depression (LTD) in the visual cortex and hippocampus, including the heterosynaptic rule fhet = [0, -p, 0 , l - p] (Stent 1973; Rauschecker and Singer 1979), and the homosynaptic rule fhm = [O,O, -T, 1 - T] (Sejnowski 197%; Stanton and Sejnowski 1989), motivated as fhet fhom
N
(input - p) x output input x (output - T)
These have been shown to have lower S/Ns than the covariance rule (Willshaw and Dayan 1990); however, for the sparse patterns, that is, low values of p and T, this difference becomes small. The sparse limit is interesting theoretically, because many more patterns can be stored, and empirically, because the cortex has been thought to employ it (see, for example, Abeles et al. 1990). All of these rules are effectively stochastic ap roximations of the covariance between input and output ((ai(w)- Lii) F j ( U ) - bj)) where the averages Ow are taken over the distributions generating the pltterns; they all share this as their common mean.' If inputs and outputs are independent, as is typically the case for heteroassociative memories, or autoassoaative ones without the identity terms, then their common expected value is zero. However, the rules differ in their variances as estimates of the covariance. Since it is departures of this quantity from its expected value that mark the particular patterns the matrix has learned, one would expect that the lower the variance of the estimate the better the rule. This turns out to be true, and for independent inputs and outputs the S/N of the rules:
S/N
[fcovl
w!p* r(1-p
1-r)
Variance of Covariance Rules for Matrix Memories
207
are inversely proportional to their variances. fcov is the best, fprd the worst, but the ordering of the other two depends on p and r. Circumstances arise under which the optimal rule differs, as for instance if patterns are presented multiple times but input lines can fail to fire on particular occasions-this would favor the homosynaptic rule. Exactly the same effect underlies the differences in efficacy between various comparison rules for reinforcement learning. Sutton (1984) studied a variety of two-armed bandit problems, which are conventional tasks for stochastic learning automata. On trial w, a system emits action y(w) E (0,l) (i.e., pulls either the left or the right arm) and receives a probabilistic reward r(w) E (0,l) from its environment, where
In the supervised learning case above, the goal was to calculate the covariance between the input and output. Here, however, the agent has to measure the covariance between its output and the reward in order to work out which action it is best to emit (i.e., which arm it is best to pull). Sutton evaluated r(w)b(w)- (y(w))] and an approximation to [r(w)- (r(w))][y(w) - (~(w))], where (y(w)) averages over the stochastic process generating the outputs and
is the expected reinforcement given the stochastic choice of y(w). These are direct analogues of fhst orfhom [depending on whether y(w)is mapped to a(w) or b(w)l and fcov, respectively, and Sutton showed the latter significantly outperformed the former. There is, however, an even better estimator. In the previous case, a and b were independent; here, by contrast, r(w) is a stochastic function of y(w). The learning rule that minimizes the variance of the estimate of the covariance is actually
+
where f = P[y(w) = O]p1 Pb(w) = l ] p ~pairs the probability of emitting action 0 with the reinforcement for emitting action 1. Williams (personal communication)suggested ? on just these grounds and simulations (Dayan 1991)confirm that it does indeed afford an improvement. Four previously suggested Hebbian learning rules have been shown to be variants of stochastic covariance estimators. The differences between their performances in terms of the signal-to-noise ratio they produce in an assodative matrix memory may be attributed to the differences in the variance of their estimates of the covariance. The same effect underlies the performance of reinforcement comparison learning rules, albeit suggesting a different optimum.
208
Peter Dayan and Terrence Sejnowski
Acknowledgments
We are very grateful to Steve Nowlan and David Willshaw for helpful comments. Support was from the SERC and the Howard Hughes Medical Institute.
References Abeles, M., Vaadia, E., and Bergman, H. 1990. Firing patterns of single units in the prefrontal cortex and neural network models. Network 1, 13-25. Anderson, J. A., and Rosenfeld, E., eds. 1988. Neurocomputing: Foundations of Research. MIT Press, Cambridge, MA. Baudry, M., and Davis, J. L. 1991. Long-Term Potentiation: A Debate of Current Issues. MIT Press, Cambridge, MA. Brown, T. H., Kairiss, E. W., and Keenan, C. L. 1990. Hebbian synapses: Biophysical mechanisms and algorithms. Annu. Rev. Neurosci. 13, 475-512. Dayan, P. 1991+ Reinforcement comparison. In Connectionist Models: Proceedings of the 2990 Summer School, D. S . Touretzky, J. L. Elman, T. J. Sejnowski and G. E. Hinton, eds. Morgan Kaufmann, San Mateo, CA. Dayan, P., and Willshaw, D. J. 1991. Optimal synaptic learning rules in linear associative memories. Biol. Cybernet. 65, 253-265. Hinton, G. E., and Anderson, J. A., eds. 1981. Parallel Models of Associative Memory. Lawrence Erlbaum, Hillsdale, NJ. Hopfield, J. J. 1982. Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kohonen, T. 1987. Content-addressableMemories, 2nd ed. Springer-Verlag,Berlin. Palm, G. 1988a. On the asymptotic information storage capacity of neural networks. In Neural Computers. NATO AS1 Series, R. Eckmiller and C. von der Malsburg, eds., Vol. F41, 271-280. Springer-Verlag,Berlin. Palm, G. 1988b. Local synaptic rules with maximal information storage capacity. In Neural b Synergetic Computers, Springer Series in Synergetics, H. Haken, ed., Vol. 42, 100-110. Springer-Verlag, Berlin. Perez-Vincente, C. j., and Amit, D. J. 1989. Optimised network for sparsely coded patterns. J. Phys. A: Math. General 22, 559-569. Rauschecker, J. P., and Singer, W. 1979. Changes in the circuitry of the kitten’s visual cortex are gated by postsynaptic activity. Nature (London) 280,58-60. Sejnowski, T. J. 1977a. Storing covariance with nonlinearly interacting neurons. J. Math. Biol. 4, 303-321. Sejnowski, T. J. 1977b. Statistical constraints on synaptic plasticity. J. Theoret. Biol. 69, 385-389. Stanton, P., and Sejnowski, T. J. 1989. Associative long-term depression in the hippocampus: Induction of synaptic plasticity by Hebbian covariance. Nature (London) 339, 215-218. Stent, G. S. 1973. A physiological mechanism for Hebb’s postulate of learning. Proc. Natl. Acad. Sci. 70, 997-1001.
Variance of Covariance Rules for Matrix Memories
209
Sutton, R. S. 1984. Temporal Credit Assignment in Reinforcement Learning. Ph.D. Thesis, University of Massachusetts, Amherst, MA. Tsodyks, M. V., and Feigel'man, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105. Willshaw, D. J., and Dayan, P. 1990. Optimal plasticity in matrix memories: What goes up MUST come down. Neural Comp. 2,8593. Received 20 January 1992; accepted 14 September 1992.
This article has been cited by: 1. Max Garagnani, Thomas Wennekers, Friedemann Pulvermüller. 2009. Recruitment and Consolidation of Cell Assemblies for Words by Way of Hebbian Learning and Competition in a Multi-Layer Neural Network. Cognitive Computation 1:2, 160-176. [CrossRef]
Communicated by Geoffrey Hinton
NOTE
Optimal Network Construction by Minimum Description Length Gary D. Kendall Trevor J. Hall Departmentof Physics, King’s College London, Strand, London WC2R 2LS, UK
1 Introduction
It has been established that the generalization ability of an artificial neural network is strongly dependent on the number of hidden processing elements and weights (Baum and Haussler 1989). There have been several attempts to determine the optimal size of a neural network as part of the learning process. These typically alter the number of hidden nodes and/or connection weightings in a multilayer perceptron by either heuristic methods (Le Cun et al. 1990; Fahlman and Lebiere 1990) or inherently via some network size penalty (Chauvin 1989; Weigend et al. 1991; Nowlan and Hinton 1992). In this note an objective method for network optimization is proposed that eliminates the need for a network size penalty parameter. 2 Network Complexity
Rissanen has proposed a formalism of Ockham’s razor with the minimum description length (MDL) principle (Rissanen 1989). It is asserted that the smallest representation of an observed training set is the statistical process by which it was generated. Hence the most probable model 8 of any system S given a set of observed data is the model that minimizes the total description length L.
L = L(S I 8 ) + q e ) In a perfect coding system the description length of a set of symbols approaches the sum of the self information of the symbols. It is therefore straightforward to closely approximate the total description length of a neural network system by summing the self information of the parameters of the network itself and the input/output values of each training pair unaccounted for by the network.
L = z(s I e) + z(e) Neural Computation 5,210-212 (1993) @ 1993 Massachusetts Institute of Technology
Optimal Network Construction
211
To calculate the self information I associated with each of these parameters the sum of the self information I(Ek) of the constituent .bytes is taken.
where
Ek
are the individual events of each byte in each parameter and
P(Ek) is the a priori probability of Ek given the distribution of bytes in the corresponding set of parameters (k E (0, . . .,2551). Network optimization using this method can take place for any parameterized neural network model with supervised learning. In minimizing the total description length, both network learning and network construction take place simultaneously. An approach that can be used to optimize an adaptive network architecture by minimum description length is to include additional parameters indicating the status of each node/weight (i.e., used or unused) from a maximum network size. In then calculating the description length of the neural network model it is necessary to include only the network parameters used.
3 Discussion
It has been found experimentally that a genetic optimization algorithm (Goldberg 1988)is successful in minimizing the total description length over the discrete space involved in this method. Weights and training data parameters are coded as signed 16 bit words with each gene representing 8 bits. A population size of 500 with a mutation probability of has been found successful for a network with a maximum of fifteen hidden nodes. Results from network simulations have demonstrated that learning on a distribution about the XOR problem, with a general recurrent network, has resulted in the most simple feedforward design to prevail within 1000 generations: A multilayer perceptron with one hidden node and direct bottom to top connections. Simulations on line detection in 9 x 9 pixel images given 200 training examples, using a multilayer perceptron architecture, have taken 300,000 generations to converge to an optimal matched filter. The final network size and weight distribution produced are implicit in the quantity of and variation in the training data provided. Training by this method not only encourages weight elimination by the reduction of parameters as in Weigend et al. (1991)but also weight sharing through the minimization of the self information of the parameters as in Nowlan and Hinton (1992).
212
Gary D. Kendall and Trevor J. Hall
Acknowledgments The work of G. D. Kendall is supported by the UK Science and Engineering Research Council and Smith System Engineering Limited. References Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization. Neural Comp. 1(1),151-160. Chauvin, Y. 1989. A back-propagation algorithm with optimal use of hidden units. In Advances in Neural Information Processing Systems I , D. S. Touretzky, ed., pp. 519-526. Morgan Kaufmann, San Mateo, CA. Fahlman, S. E., and Lebiere, C. 1990. The cascade-correlation learning architecture. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed.,pp. 524-532. Morgan Kaufmann, San Mateo, CA. Goldberg, D. E. 1988. Genetic Algorithms in Search, Optimisation and Machine Learning. Addison-Wesley, Reading, MA. Le Cun, Y., Denker, J. S., and Solla, S. A. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems 2, D. s. Touretzky, ed., pp. 598605. Morgan Kaufmann, San Mateo, CA. Nowlan, S. J., and Hinton, G. E. 1992. Simplifying neural networks by soft weight-sharing. Neural C O M ~4(4), . 473-493. Rissanen, J. 1989. Stochastic Complexity in Statistical Inquiry. World Scientific. Weigend, A. S., Rumelhart, D. E., and Huberman, 8. A. 1991. Generalization by weight-elimination with application to forecasting. In Advances in Neural Infomation Processing Systems 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. Received 10 April 1992; accepted 3 September 1992.
This article has been cited by: 1. X.M. Gao, X.Z. Gao, J.M.A. Tanskanen, S.J. Ovasaka. 1997. Power prediction in mobile communication systems using an optimal neural-network structure. IEEE Transactions on Neural Networks 8:6, 1446-1455. [CrossRef] 2. Guozhong An . 1996. The Effects of Adding Noise During Backpropagation Training on a Generalization PerformanceThe Effects of Adding Noise During Backpropagation Training on a Generalization Performance. Neural Computation 8:3, 643-674. [Abstract] [PDF] [PDF Plus] 3. Gary D. Kendall, Trevor J. Hall, Timothy J. Newton. 1993. An investigation of the generalisation performance of neural networks applied to lofargram classification. Neural Computing & Applications 1:2, 147-159. [CrossRef]
Communicated by Jack Byme
A Neural Network Model of Inhibitory Information Processing in Aplysia Diana E. J. Blazis Thomas M. Fischer Thomas J. Carew Departments of Psychology and Biology, Yale University,
New Haven, CT 06520 USA
Recent cellular studies have revealed a novel form of inhibitory information processing in the siphon withdrawal reflex of the marine mollusc ApZysia: Motorneuronal output is significantly reduced by activity-dependent potentiation of recurrent inhibition within the siphon withdrawal network (Fischer and Carew 1991, 1993). This inhibitory modulation is mediated by two types of identified interneurons, L29s and L30s. In an effort to describe and analyze this and other forms of inhibitory information processing in Aplysia, and to compare it with similar processing in other nervous systems, we have constructed a neural network model that incorporates many empirically observed features of these interneurons. The model generates important aspects of the interactions of cells L29 and L30, and with no further modification, exhibits many network level phenomena that were not explicitly incorporated into the model. 1 Introduction
Recurrent inhibitory circuits are a common architectural feature of nervous systems, providing a powerful mechanism for rapid and precise control of neural output (Pompeiano 1984). Moreover, intrinsic and extrinsic modulation of recurrent inhibitory circuitry endows those networks with a high degree of flexibility and an enhanced capacity for adaptive modification. This type of modulation has been described in several systems including spinal motor neurons (Pompeiano 1984; Fung et al. 1988) and hippocampal circuits (Miles 1991). Recent cellular studies in the marine mollusc Aplysia have identified a recurrent inhibitory circuit in the neural network underlying the siphon withdrawal reflex (SWR)(Frost et al. 1988; Hawkins and Schacher 1989; Fischer and Carew 1991, 1993). The circuit is formed between identified interneurons L29 and L30: L29 provides excitatory input to L30, which projects back on L29 with an inhibitory synapse. The output element of this circuit, cell Neural Computation 5, 213-227 (1993) @ 1993 Massachusetts Institute of Technology
214
D. E. J. Blazis, T. M. Fischer, and T. J. Carew
L29, provides substantial input to siphon motor neurons (MNs). In the L29/L30 circuit, recurrent inhibition of L29 can be modulated in at least two ways: (1)it is reduced by stimulation mimicking tail shock (Frost et al. 1988), and (2) it is increased by prolonged direct activation of either L29 or L30 (Fischer and Carew 1991, 1993). Thus, L29/L30 interactions provide an example of a recurrent inhibitory circuit that can be modified by both extrinsic and intrinsic inputs. Moreover, the L29/L30 circuit is of additional interest because it exhibits an intrinsic form of plasticity, activity-dependent potentiation of recurrent inhibition (described more fully below). This form of use-dependent regulation of recurrent inhibition endows the L29/L30 circuitry with the capacity for dynamic gain control in the SWR (Fischer and Carew 1993). The SWR of Aplysia has been widely used as a model system for studies of learning and memory. The siphon is located in the mantle cavity and is used as an exhalent funnel for respiration (Fig. 1A). In response to tactile stimuli delivered directly to the siphon or to other sites on the body the animal withdraws its siphon into the mantle cavity. The SWR and its circuitry show a variety of adaptive modifications (for review, see Carew and Sahley 1986). Large cells and restricted neural networks have facilitated the identification and analysis of neural circuitry underlying this response and its modification. Interneurons L29 and L30 are located in the abdominal ganglion (a central ganglion that contains most of the circuitry underlying the SWR) (Fig. IB). L29 interneurons (about 5 in number) were previously shown to play an important role in excitatory information processing in the SWR circuit (Hawkins et al. 1981; Frost et al. 1988; Hawkins and Schacher 1989). L29s are activated by siphon stimulation, provide substantial excitatory input to identified siphon motor neurons (MNs, e.g., the LFS MNs, Frost et al. 1988, see also Fischer and Carew 1993), and receive recurrent inhibition from L30 interneurons (about 2 in number). Recently, Fischer and Carew (1991,1993) have shown that direct activation of a single L29 produces transient inhibition of the total reflex input to the MNs. They further showed that this inhibition occurs because L29s recruit recurrent inhibition onto themselves from L30 interneurons. The net effect of this recruitment of inhibition is that the L29 response to siphon input is significantly reduced for up to 40 sec following intracellular activation, resulting in a smaller net EPSP in the MN. The mechanism subserving this inhibition appears to be use-dependent facilitation of the inhibitory post-synaptic potential (IPSP) from L30 to L29 (Fischer and Carew 1991, 1993). We have recently developed a neural network model to describe this form of inhibitory information processing in Aplysia both to quantitatively analyze this type of adaptive modification in the SWR and to examine the possible behavioral significance of the L29/L30 circuit in the SWR neural network. Our overall strategy is to first represent in computational form the principal features of each identified cell, and then to
Inhibitory Information Processing in Aplysia
A.
215
B.
C.
Figure 1: Schematicillustrationof the siphon withdrawal reflex of Aplysia examined at several levels of analysis. (a) Diagram of intact Aplysia, illustrating the position of the siphon (adapted from Kandel 1979). (b) Diagram of a reduced preparation used to analyze neuronal correlates of behavior, showing approximate locations of interneurons and motorneuronsdescribed in the present study. Reflex input can be elicited with a tactile stimulus to the siphon while simultaneously recording from identified elements in the circuit for siphon withdrawal. (c) Connectivity of neural network model. The asterisk denotes a synapse that exhibits use-dependent potentiation.
216
D. E. J. Blazis, T. M. Fischer, and T. J. Carew
progressively refine the model by continually adding empirically derived cellular parameters. Thus, a number of important cellular parameters and interactive mechanisms are not yet fully incorporated in the model. [Some of these features have been modeled by other investigators: sensorimotor facilitation and inhibition have been analyzed by Gingrich and Byrne (1985)and Byrne et al. (19891,and interneuronal contributions to MN response duration have been studied by Frost et al. (1991).1Nonetheless, our model describes key features of the relationship between L29 and L30 and generates network-level phenomena of the SWR, including activity-dependent potentiation of recurrent inhibition.
2 Methods
Biophysical and synaptic parameters in the model are derived or estimated from cellular studies in Aplysia and other invertebrates. Parameters of the model are listed in Table 1, and equations for selected features of the model are listed below. We have implemented our model with the network simulator GENESIS (Wilson et al. 19891, using an exponential Euler integration method and a time step of 0.1 msec.
2.1 Network Architecture. The network consists of a sensory module and four single compartment cells: three interneurons and a motor neuron (Fig. 1C).The sensory module consists of four receptors that mimic siphon sensory input. The number of sensory receptors selected was arbitrary. In Aplysia, sensory input from the siphon accelerates rapidly and then declines over the course of continued stimulation. The mechanisms underlying this accommodation are not yet incorporated into the model; instead, receptor firing rates are set so as to decline over stimulus presentation. Another feature of sensory neurons is synaptic depression, which occurs with repeated sensory neuron stimulation at relatively short interstimulus intervals (ISIs; Castelluci et al. 1970; Byrne et al. 1989). In our empirical studies, siphon stimulation was presented at an IS1 of 5 min, an interval that precludes habituation at both behavioral and synaptic levels. Therefore, the current model does not incorporate a representation of synaptic depression. The MN has only three synaptic inputs, one from the sensory module, one from a generic excitatory interneuron (denoted as E in Fig. 10, and one from a single L29. Three features of the biological circuit are incorporated in the network: (1) L29 makes an excitatory synapse onto L30; (2)L30, in turn, projects back upon L29 with an inhibitory synapse that shows marked use-dependent potentiation (indicated by an asterisk); and (3) there is a rectifying electrical synapse between L29 and L30.
Inhibitory Information Processing in Aplysia
217
Table 1: Parameters of the Model. Parameter
L29
L30
E
MN
c,
.63 -61 12
.63 -37 12
.63 -40 12
.63 -4s 12
1.7 +55 .0025 -65 2.85 -65
1.98 $55 0.61 -60
3.14 +55 0.28 -60
3.14 +55 0.61 -60
(nF) Em (mV) Rill
Active conductances G N (lo-' ~ S) (mV) GK, s) EK, (mV) G K ~ s) E m 9 (mV) Leak conductance G b , a (lO-'OS) E ~ e a k(mV) sensory inputs G (10-5 s) 71 (msec), 72 (msec) Em, (mV) output Target G (10-5 s) 71 (msec), (msec) E , (mV) Electrical s y n a p d Target
r Saturation (nA)
80 -75
1.7 13 55 L30/MN 0.11 1,3/ 1,7 55
L30 02-7
0.55
-
-
-
0.1
0.004
-75
-75
0.004 -75
5.5 13 55
7.5 L3 55
1.7 13 55
-
L29
MN
0.054-0.4ob
0.11
9,s -65
13
L29 0.05e-' 0.08
55
-
-
-
@Hyperpolarizedto -70 mV in all simulations. bConductancerange, computed as a function of WO activation frequency and time; see text equation 3. CCurrentacross electrical synapse: See equations 4a and 4b and accompanying text for explanation of terms.
2.2 Synaptic Conductances. The model incorporates time-dependent synaptic conductances. Ionic current resulting from activation of a synaptic conductance is calculated as follows:
where I = current, E,, reversal potential of the ion, and Emem= resting potential of the cell. G(t) is computed with the dual exponential form:
D.E.J. Blazis, T. M.Fischer, and T.J. Carew
218
where w is a weighting factor that is free to vary. For the EPSP and IPSP between L29 and L30 (respectively), the reversal potential and time constants of the simulated conductances reflect those empirically observed. The magnitudes of synaptic contributions from each of the three pathways converging onto the MN were specified in the following way. First, the strength of the sensory input was set to result in peak complex EPSP amplitude of 5-10 mV (in all cases, the MN was hyperpolarized to -70 mV, as in empirical studies). The strength of the L29 EPSP onto the MN was set to produce a depolarization of about 10 mV. The strength of the sensory input to L29 was set to yield an appropriate firing pattern in response to a short stimulus (that is, 4-6 action potentials to a 100 msec activation of the sensory receptors). Since the complex EPSP seen in the MN in response to sensory stimulation typically has a peak amplitude of 25-30 mV (Fischer and Carew 19921, the strength of the connection from interneuron E was then set to make up the difference between that amount and the contribution of the sensory array and the L29 input. A look-up function describing activity-dependent potentiation of the L30 IPSP has been incorporated in the model. This preliminary representation of potentiation of the L30 IPSP is based on a linear regression of data obtained by Fischer and Carew (1991,1992)and specifies that the inhibitory synaptic conductance from L30 onto L29 changes as a function both of L30 activation and of time:
+
+
~ ( t =) k ( O . 1 ~ 0.591~ I)
(3)
where x = spike count of L30 during activation, y = time after stimulation offset, and k = 3400 (a value which yields approximately a 1 mV IPSP in L29 at baseline). In the simulations presented here, activation of L30 by sensory stimulation is not taken into account. Thus the model slightly underestimates the weight of the inhibitory connection from L30 to L29. 2.3 Active Conductances. Active conductances are based on the Hodgkin and Huxley (1952)formulations. At present, each cell has a sodium conductance (GNJ, a delayed rectifier potassium conductance the implethat exhibits inactivation (GK"), and a leak conductance (GL); mentations of these conductances were drawn from existing GENESIS version 1.3 libraries. L29 has an additional, noninactivating potassium current (GKU~,derived from Byrne 1980). Parameters for the various conductances were set to approximate the firing patterns of each cell. In more detailed versions of the model, other conductances will be integrated as well (e.g., the Ca-dependent potassium conductance described for cell L30,Frost et al. 1988).
2.4 Electrical Synapse. The model includes a representation of the electrical synapse between L29 and L30. The current through the electri-
Inhibitory Information Processing in Aplysia
219
cal synapse at a given time is proportional to the differencebetween the potentials of cells L29 and L30:
where rl and r2 (values in Table 1) are coupling coefficients for the simulated electrical synapse between cells L29 and L30 that reflect the results of empirical studies (Fischer and Carew 1993). Like the biological synapse, the simulated electrical synapse is recbfying, such that current passes more readily from L29 to L30 than in the reverse direction. The current from one cell to another is also bounded (“saturation” term in Table 1)to reflect the fact that large voltage trajectories, such as those due to action potentials, are severely attenuated by the electrical synapse. Taken collectively, the above assumptions allow the model to generate salient features of the L29-L30 interaction. In the sections that follow, we compare our empirical results with our simulations of a variety of physiological manipulations of this simple network. 3 Results 3.1 Characteristic L29/L30 Interactions. Figure 2 shows that the model produces key features of the firing pattern of interneuron L29. The left column shows empirically obtained results (Fischer and Carew 1991,1993),and the right column shows the simulations. Traces A, through D1 of Figure 2 show characteristic features of cell L29: (1)L29 responds to siphon input with a brief burst of action potentials (Fig. 2A1). (2) L29 exhibits a characteristic response to maintained intracellular injection of depolarizing current, responding initially with a brisk burst of action potentials that becomes arrhythmic (Fig. 2B1). This %uttering” occurs because L29 activates L30, thereby recruiting inhibition onto itself. (3) Subthreshold depolarization of L29 recruits IPSPs back onto itself (Fig. 2C1) by virtue of current flow to L30 through the electrical synapse (L30has a lower voltage threshold for action potential initiation than L29). (4) L29 receives an IPSP from L30 that is potentiated by repetitive firing of L30 (Fig. 2D1).In the experiment shown in Figure 2D, cell L29 has been hyperpolarized to a membrane potential more negative than the reversal potential of the L30 IPSP, resulting in a depolarizing synaptic response. A single action potential from cell L30 (indicated by the arrow) results in a small IPSP in L29 (PRE).When cell L29 is then repeatedly activated, thereby activating L30 (not shown), dramatic usedependent facilitation results in a 2- to 3-fold increase in the L30 IPSP amplitude (POST). Our simulations of the L29-L30 interactions are qualitatively quite similar to the empirical data (Fig. 2, A2-41, but several aspects of the
D. E. J. Blazis, T. M. Fischer, and T. J. Carew
220
EMPIRICAL RESULTS
SIMULATIONS
B,
m
WO
Splke
Figure 2 Comparison of empirical results and simulations showing characteristic features of the firing pattern of cell L29 and interactions between cells L29 and L30. (Al, A2) Response of cell L29 to brief tactile stimulation. (B1, b)Response of cell L29 to intracellular current injection. (C1, Cz) Recruitment of IPSPs onto L29 from L30 resulting from subthreshold current injection into L29. (D1,D2)Use-dependent potentiation of the IPSP from L30 to L29 (the IPSP is depolarizing due to hyperpolarization [HYP]of L29, see text). In both the empirical and simulated experiments, L29 was activated for 5 sec by current sufficient to produce a firing rate of 30 Hz (approximately 3 4 nA),which generates sufficient activation of L30 (not shown) to potentiate the L30-L29 IPSP
simulations warrant comment. First, the frequency of firing to siphon input is somewhat lower for the simulated L29 (Fig. 2A2)than that empirically observed. Second, the simulated firing frequency of L29 to current injection does not completely match the empirical data (Fig. 2B2). Third, there is a reasonable match of the empirical data to simulations of subthreshold recruitment of inhibition onto L29 (Fig. 2C2)and facilitation of the IPSP from L30 to L29 (Fig. 2D2).
Inhibitory Information Processing in Aplysia
221
In summary, the model produces essential features of the interaction between L29 and L30. In addition, as will be discussed below, with no further changes, the model also generates important aspects of the synaptic input to the MNs as well as inhibitory modulation of that input. 3.2 Reduction of M N Output by Voltage Clamp of L29. Empirical results shown in Figure 3 (upper half) illustrate the effect on the complex EPSP in a siphon MN of voltage clamping a single L29 during siphon stimulation. On the left, synaptic input to the MN and L29 was elicited by tactile stimulation of the siphon (see Fig. 1B). On the right are shown the responses of these same cells to siphon tap when a single L29 is functionally removed from the circuit by voltage clamping it at its resting potential (approximately -60 mV). In the example shown in Figure 3, the evoked complex EPSP in the MN is diminished to 50% of control. The simulated data (Fig. 3, lower half) capture this result qualitatively, although on average the simulated complex EPSP is shorter in duration than the empirically measured EPSP (note differences in gain and time base in Fig. 3). 3.3 Inhibition of M N Output by Intracellular Activation of L29. As described above, Fischer and Carew (1991, 1993) found that intracellular activation of L29 results in sigruficant inhibition of the tap-evoked complex EPSP in the MN for about 40 sec following L29 activation. In the example shown in Figure 4A1, activation of the S W R network by a siphon tap resulted in a complex E S P of about 25 mV in the MN and a brisk burst of action potentials in L29. A single L29 was then activated with intracellular current for 5 sec (not shown). Twenty seconds after L29 activation, the complex EPSP elicited by an identical siphon tap was significantly reduced (to about 5 mV), as was the response of L29 itself. Both the EPSP and the response of L29 to siphon tap recovered 5 min later. As shown in Figure 4A2, our simulations of activity-dependent potentiation of recurrent inhibition induced by activation of L29 are qualitatively similar to the empirical results. Specifically, 20 sec after L29 activation (and the consequent activity-dependent potentiation of the L30 IPSP onto L29) the tap-evoked complex EPSP is reduced (although the magnitude of the inhibitory effect is smaller than that empirically observed, see Discussion), and the response of L29 itself is also markedly reduced. Both the complex EPSP and the response of L29 to siphon tap recovered 5 min later. The mechanism subserving the inhibitory effects of L29 appears to be that it recruits inhibition onto itself from L30. Figure 4B1 is an empirical record showing that injection of depolarizing current into L29 activates L30 through both chemical and electrical synapses (Fig. 1C). This empirical result is also produced by the model (Fig. 4B2). That L29 activation potentiates the L30 IPSP implies that direct activation of L30 should also produce inhibition of the complex EPSP at the MN; this is indeed the
D.E. J. Blazis, T. M. Fischer, and T. J. Carew
222
case (Fischer and Carew 1991,1993). Moreover, the magnitude and time course of IPSP potentiation by activation of L29 (Fig. 2D1)map onto the inhibition of the tap-evoked response of the MN (Fischer and Carew 1991, 1993).
EMPIRICAL RESULTS
SIMULATIONS Baseline -------
Clamp
LFS Motor Neuron
L29
Vm
1
000 Mln SlphonTap
I
500 Mln
J15 mV 5 o m
Inhibitory Information Processing in Aplysia
223
4 Discussion
We have presented a preliminary computational analysis of a neural network that exhibits activity-dependentpotentiation of recurrent inhibition. We are encouraged that the model captures some of the key features of the L29-L30 interaction and, without further modification, exhibits some of the empirically observed circuit interactions, including inhibitory modulation, that were not explicitly incorporated into the model. 4.1 Discrepancies between Empirical Results and Simulations. The simulations of activation (through current injection) and inactivation (through voltage clamp) of L29 (Figs. 3 and 4) show that the model describes the empirical results qualitatively, but, at least in some cases, not quantitatively. First, the EPSPs in the simulated MN are of shorter duration than those observed empirically; and in the case of activation of L29, the magnitude of inhibition observed is somewhat smaller than that shown in Figure 4A (although the magnitude of the simulated inhibition is quite close to the average L29-induced inhibition of 20% observed by Fischer and Carew 1993). It is likely that the discrepancies between simulated and empirical results due to the fact that the current model incorporates only a single L29 into the circuit, whereas there are at least 5 L29 cells in the actual reflex circuit. Most of the L29s appear to contribute to the EPSP recorded in the MN (Frost et aZ. 1988; Hawkins and Schacher 1989; Fischer and Carew 1993), and thus our model underestimates the summed contribution of the entire L29 class. Ongoing simulations incorporating multiple L29s can diredly test this hypothesis (Blazis ef aZ. 1992) and, in addition, may shed light on the unique contribution of a single L29 to various features of network behavior, a determination that would be difficult (if not impossible) to achieve in cellular experiments alone. Related questions of interest in this computational analysis will focus on the functional significanceof particular architectural and synaptic features of the network, such as the role of multiple L29s and W s , the contribution of the redifyhg electrical synapse between L29 and L30, and the dynamics of facilitation of L30 synapses under different activation conditions.
Figure 3 Facing p g e . Comparison of empirical results and simulations showing the response of L29 and a siphon MN to tactile stimulation before and during voltage clamp of L29. For both the empirical and simulated results, left-hand traces show the mponse of the MN and a single L29 to a siphon tap (arrow) before voltage clamp. Right-hand traces show the response of the h4N and L29 when L29 is voltage clamped at its resting potential. In this and subsequent figures, the MN is hyperpolarized to reveal underlying EPSPs.
224
D.E. J. Blazis, T.M. Fischer, and T.J. Carew
4.2 Role of the L29L30 Circuit in Information Processing in the SWR. Our current cellular and computational analyses are aimed at determining the functional relationships between activity-dependentpotentiation of recurrent inhibition and other forms of plasticity observed in
Inhibitory Information Processing in Aplysia
225
the SWR. For example, since the L30 synapse potentiates at relatively low rates of activation (Fischer and Carew 1993)one form of plasticity that the L29-L30 circuit could theoretically mediate is habituation (habituation is know to involve reduction of afferent input to MNs with repeated sensory stimulation, see Castelluci et al. 1970). Build-up of use-dependent potentiation of the L30 IPSP, achieved via repeated siphon stimulation at short ISIs known to produce habituation, could progressively remove the contribution of L29 interneurons to siphon MN output. Indeed, our ongoing simulations suggest that the L29/L30 circuit alone can mediate at least partial habituation of reflex output, depending on stimulus duration and IS1 (Blazis et al. 1992). Thus, inhibition created by the L29/L30 recurrent inhibitory circuit could augment other mechanisms, such as homosynaptic depression, thought to subserve habituation in Aplysia (Castelluci et al. 1970). In addition to a role in nonassociative learning, L29 and L30 might also contribute to associative processes. For example, as described by Fischer and Carew (19931, this interneuronal network could play a role in changes in response topography associated with classical conditioning of the SWR (Hawkins et al. 1989; Walters 1989). In conclusion, cellular and computational analyses of simple nervous systems like that of Aplysia can yield insights that are useful for understanding both natural and artificial intelligent systems (see Hawkins and Kandel 1984; Card and Moore 1990). To date, most studies of the SWR of Aplysia have focused on a single synapse, that between the siphon sensory neurons and motor neurons (for review, see Carew and Sahley 1986). However, several studies (Frost et al. 1988; Hawkins and Schacher 1989; Fischer and Carew 1991, 1992) as well as the present work show that the SWR network also contains functionally important recurrent inhibitory circuits that can be modulated by both extrinsic
Figure 4: Facing page. Comparison of empirical results and simulationsshowing inhibition of the complex EPSP in the MN and recruitment of L30 firing following intracellular activation of L29. (Al, A2) Empirical result (All and simulation (A2) of L29-induced inhibition of the complex ESP in the h4N. For each pair of traces, the top trace shows the response of the MN md the lower trace shows the response of L29. 0.00 min: response of MN and L29 to siphon tap (arrow). 5.00 min: response of MN and L29 to siphon tap 20 sec following intracellular activation of a single L29 (3-4 nA, 5 sec, as in Fig. 2D). The diminished EPSP amplitude of the MN and reduced responding of L29 occurs because activation of L29 recruits inhibition from L30 onto L29, effectively removing L29 from the circuit (see text). 10.00 min: response of the MN and L29 to siphon tap 5 minutes after L29 activation. (B1, B2) Empirical result (B1) and simulation (B2) of activation of L30 by intracellular current injection into L29. Note that, in response to maintained depolarization of L29, L30 continues to fire action potentials (even after L29 has been silenced by recurrent inhibition)due to the electrical synapse between L29 and L30 (see Fig. 2B,C).
226
D. E. J. Blazis, T. M. Fischer, and T. J. Carew
inputs and intrinsic activity. The ability of cellular studies in Aplysia to pmvide critical physiological constraints for a biologically realistic model greatly facilitates a computational analysis. In tun,a computational analysis can pmvide general insights into the principles of operation underlying different forms of information processing in an identified neural network. Acknowledgments
We thank Kent Fitzgerald, Edward Kairiss, and Emilie Marcus, and two anonymous reviewers for valuable comments on an earlier version of the manuscript. We are also grateful to David Berkowitz and Edward Kairiss for many helpful discussions. This research was supported by National Research Institute Service Award IF32-MH10134-02to D. E. J. B., Grant PHS T32-MH1839705to T.M. F., and AFOSR Award AF 89-0362to T.J. C. References Blazis, D. E. J., Fischer, T. M., and Carew, T. J. 1992. A neural network model of use-dependent gain control in the siphon withdrawal reflex of Aplysia. SOC. Neurosci. Abstr. 18,713. Byrne, J.H.1980. Quantitative aspects of ionic conductance mechanisms contributing to firing pattern of motor cells mediating inking behavior in Aplysia californica. J. Neurophysiol. 43, 651-668. Byrne, J. H., Gingrich, K. J., and Baxter, D. A. 1989. Computational capabilities of single neurons: Relationship to simple forms of associative and nonasd a t i v e learning in Aplysia. In Computational Models of Learning in Simple Neural Systems, R. D. Hawkins and G. H. Bower, eds., pp. 31-63.Academic Press, San Diego. Card, H.C., and Moore, W. R. 1990. Silicon models of associative learning in Aplysia. Neural Networks 3,333-346. Carew, T. J., and Sahley, C. L. 1986. Invertebrate learning and memory: From behavior to molecules. Annu. Rev. Neumci. 9,435487. Castelluci, V., Pinsker, H., Kupfermann, I., and Kandel, E. R. 1970. Neuronal mechanisms of habituation and dishabituation of the gill-withdrawal reflex of Aplysia. Science 167,1745-1748. Fischer, T. M.,and Carew, T. J. 1991. Activation of the facilitatory interneuron L29 produces inhibition of reflex input to siphon motor neurons in Aplysia. Soc. Neurosci. Abstr. 17,1302. Fischer, T. M.,and Carew, T. J. 1993. Activity dependent potentiation of recurrent inhibition: A mechanism for dynamic gain control in the siphon withdrawal reflex of Aplysia. I. Neurosc., in press. Frost, W. N., Clark, G. A., and Kandel, E. R. 1988. Parallel processing of shortterm memory for sensitization in Aplysia. J. Neurobiol. 19,297-334.
Inhibitory Information Processing in Aplysia
227
Frost, W. N., Wu., L. G., and Lieb, J. 1991. Simulation of the Aplysia siphon withdrawal circuik Slow components of internmnal synapses contribute to the mediation of reflex duration. SOC.Neurosci. Abstr. 17,1390. Fun& S. J., Pompeiano, O., and Barnes, C. D. 1988. Coerulospinal influence on recurrent inhibition of spinal motonuclei innervating antagonistic hindleg muscles of the cat. PflugmArch. 412,346-353. Gingrich, K. J., and Byme, J. H. 1985. Simulation of synaptic depression, posttetanic potentiation, and presynaptic facilitation of synaptic potentials from sensory neurons mediating gill-withdrawal reflex in Aplysia. J.Neurophysiol. 53,652-669. Hawkins, R. D., and Schacher, S. 1989. Identified facilitator neurons L29 and L28 are excited by cutaneous stimuli used in dishabituation, sensitization, and classical conditioning of Aplysia. J. Neurosci. 9,42364245. Hawkins, R. D., and Kandel, E. R. 1984. Is there a cell biological alphabet for simple forms of learning? Psychol. Rev. 91,375-391. Hawkins, R. D., Castelluci, j7. F., and Kandel, E. R 1981. Interneurons involved in mediation and modulation of gill-withdrawal reflex in Aplysia. I. Identification and characterization. J. Neumphysiol. 45,304-314. Hawkins, R. D., Lalevic, N., Clark, G. A., and Kandel, E. R. 1989. Classical conditioning of the Aplysia siphon-withdrawal reflex exhibits response specificity. Proc. Natl. Acad. Sci. U.S.A. 86,7620-7624. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitativedescription of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London) 138,253-281. Kandel, E. R. 1979. Behavioral Biology of Aplysia: A Contribution to the Comparative Study of OpisthobranchMolluscs. W. H. Freeman, San Francisco. Miles, R. 1991. Tetanic stimuli induce a short-term enhancement of recurrent inhibition in the CA3 region of guinea-pig hippocampus in vitro. J. Physiol. 443, 669-682. Pompeiano, 0. 1984. Recurrent inhibition. In Handbook of the Spinal Cord, N. Davidoff, ed., pp. 461-557. Marcel Dekker, New York. Walters, E. T. 1989. Transformation of siphon responses during conditioning of Aplysia suggests a model of primitive stimulus-response association. Proc. Natl. Acad. Sci. U.S.A. 86, 7616-7619. Wilson, M. A., Bhalla, U. S., Uhley, J. D., and Bower, J. M. 1989. GENESIS: A system for simulating neural networks. In Advances in Neural Information Processing Systems I, D. S. Touretzky, ed.,pp. 485492. Morgan Kaufmann, San Mateo, CA. Received 27 February 1992; accepted 21 August 1992.
This article has been cited by: 2. S. Yamada, M. Nakashima, S. Shiono. 1998. Reinforcement learning to train a cooperative network with both discrete and continuous output neurons. IEEE Transactions on Neural Networks 9:6, 1502-1508. [CrossRef] 3. Joe L. Martinez, Brian E. Derrick. 1996. LONG-TERM POTENTIATION AND LEARNING. Annual Review of Psychology 47:1, 173-203. [CrossRef]
Communicated by Gordon Shepherd
Computational Diversity in a Formal Model of the Insect Olfactory Macroglomerulus C . Linster Ecole Supkrieure de Physique et de Chimie Industrielles de la Ville de Paris, Luboratoire d'Electronique, 10, rue Vauquelin, 75005 Paris, France
C. Masson Luboratoire de Neurobiologie Comparie des Invertkhris, INRAICNRS (URA1190), 92140 Bures Sur Yvette, France
M. Kerszberg Biologie Molkulaire, CNRS (USA 1284), Institut Pasteur, 25, rue du Docteur Roux, 75015 Paris, France
L. Personnaz G . Dreyfus Ecole Suphieure de Physique et de Chimie Industrielles de la Ville de Paris, Luboratoire d'Electronique, 10, rue Vauquelin, 75005 Paris, France
We present a model of the specialist olfactory system of selected moth species and the cockroach. The model is built in a semirandom fashion, constrained by biological (physiological and anatomical) data. We propose a classification of the response patterns of individual neurons, based on the temporal aspects of the observed responses. Among the observations made in our simulations a number relate to data about olfactory information processing reported in the literature; others may serve as predictions and as guidelines for further investigations. We discuss the effect of the stochastic parameters of the model on the observed model behavior and on the ability of the model to extract features of the input stimulation. We conclude that a formal network, built with random connectivity, can suffice to reproduce and to explain many aspects of olfactory information processing at the first level of the specialist olfactory system of insects. 1 Introduction
We study the detection of sexual pheromones by insects, with a view to the more general modeling of the olfactory pathway. We use the known anatomical data of the olfactory system, and retain the level of detail we deem necessary to produce biologically relevant behavior. Thus, we do Neural Computation 5, 228-241 (1993) @ 1993 Massachusetts Institute of Technology
Model of the Insect Olfactory Macroglomerulus
229
not attempt to represent the particulars of dendritic passive propagation, or the precise local input-output functionalityof individual synapses. We model them by simple ingredients such as propagation delays and activation thresholds. Precise biological data on the wiring is not available, therefore we introduce randomness in the connectivity. A variety of approaches to the modeling of olfactory systems have been presented thus far: the pioneering work of Rall and Shepherd (1968) exploits a wealth of detail concerning the precise shape of mitral cell dendrites in order to compute electrical potentials; Wilson and Bower (1988, 1989) and Haberly and Bower (1989) replicate certain basic features of responses by extensive simulations of larger cell sets exploiting the same data; Lynch and Granger (Lynch et al. 1989; Lynch and Granger 1989) study associative memory and synaptic adaptation in piriform cortex, including considerable detail about synaptic processes, and a Hebb-type learning rule; Li and Hopfield (1989) attempt to abstract a set of relevant parameters from the biological details of the olfactory modular organization, with a highly simplified model: interneurons are lumped into single variables. In contrast, we study the individual and collective behavior of neurons whose dendrites make contacts within the so-called macroglomerulus (or macroglomerular complex, MGC), which is responsible for sexual pheromone recognition. The aim of our work is to analyze the emergence of the responses necessary for odor recognition and localization. 2 Biological Background
In the olfactory system of insects, two subsystems process behaviorally important odor classes: the specialist subsystem detects sexual pheromones, while the generalist subsystem recognizes food odors (for a review see Masson and Mustaparta 1990). In the following, we focus on the specialist subsystem. It receives information from sensory neumns, which are sensitive to non-overlapping molecule spectra ("labeled lines"). The axons of sensory neurons project onto the antennal lobe local interneurons, which possess no axons, and onto the antennal lobe projection or output neurons. The latter transfer signals to other centers for further integration with other sensory modalities. The huge convergence between pheromone-sensitive and projection neurons, which Ernst and Boeckh (1983) estimate as 5000:l in the cockroach, leads to a characteristic spatial organization of all synaptic connections in subassemblies termed gzomeruli, which are identifiable and species-specific. In the case of interest to us, (e.g., in certain moth species and in the cockroach),this reduces to a single MGC (Fig. 1). We use data pertaining to the moth species Manducu sexta and to the cockroach Penplaneta americana (for reviews see Christensen and Hildebrand 1987a; Boeckh and Ernst 1987). The complex responses to stim-
230
C. Linster et al.
Figure 1: schematic representation of the specialist oIfactory system. In the macmglomdus, receptor cell axons connect with local interneurons(restricted to the antennal lobe), and with pmjection n m n s , which convey information to higher centers.
ulation by pheromone blends, as observed intracellularly in projection newom, indicate that integrative pmesses take place in the MGC. In the moth, the depolarization of a local intemeuron can cause inhibition of background activity in a projection neuron. There is also evidence that local interneurons are responsible for much or all of the inhibitory synaptic activity (Christensen and Hildebrand 198%). Furthermore, the long-latency excitation exhibited by some projection neurons suggests that polysynaptic pathways are present between pheromone-responsive primary afferent axons and the projection neurons. In fact, it has been demonstrated, in the cockroach, that the receptor axons synapse mainly with local interneurons (Boeckh et al. 1989; Distler 1990). 3 The Formal Model
Neurons may be at rest ( x = 0) or above firing threshold ( x = 1). They are probabilistic neurons with memory: the probability P[Xi(t) = 11 that the state x i ( t ) of neuron i at time t is 1 is given by a sigmoid function of
Model of the Insect Olfactory Macroglomerulus
231
the neuron membrane potential q ( t ) at time t:
which is biased by a positive threshold 8j, and where T is a parameter, called temperature, which determines the amount of noise in the network (random fluctuations of the membrane potential). In discrete time, the fluctuation of the membrane potential around the resting potential, due to input e;(t) at its postsynaptic sites, is expressed as At * V j ( t - At) - .ej(t - At)
+ 7i
where 7; is the membrane time constant and At is the sampling interval, with e;(t) =
C[W - xj(t~- Q)]
w h m Wij is the weight of the synapse between neuron j and neuron i, and q is its delay. The weights are binary. The value of the transmission delay associated with each synapse is fixed but chosen randomly; it is meant to model all sources of delay, transduction, and deformation of the transmitted signal from the cell body or dendrodendritic terminal of neuron j to the receptor site of neuron i. The mean value of the delay distribution is longer for inhibition than for exatation: we thereby take into account approximately the fact that IPSCsusually have slower decay than EPSCs, and may accumulate to act later than actually applied. We consider three types of neurons: receptor, inhibitory and exatatory. Two types of receptor neurons (A and B ) are sensitive only to input A or B, where A and B represent two odor components. For all A (respectively B ) type receptor neurons, we have ei(t) = A(t), [respectively B(t)l, where A(t) is the concentration of component A. Receptor neurons may make axodendritic (YY = 0), exatatory synapses with both types of interneurons. Interneurons may make dendrodendritic synapses (Tij # 0 ) with any other intemeuron, but the connectivity c will be sparse.
4 Results
To analyze the behavior of such a network, we first introduce a classification of the possible response patterns of the neurons, which has been found useful for the analysis of olfactory response patterns (Meredith
232
C. Linster et al.
1986; Kauer 1974; Fonta et al. 1991). In the network under investigation,' which exhibits a typical distribution of response patterns, we observe three classes of patterns: purely excitatory, purely inhibitory, and mixed (both inhibitory and excitatory) responses. Excitation and inhibition are defined in relation to the neuron spontaneous activity. The mixed response patterns subdivide into three groups, according to the relative durations of the inhibition and excitation phases (Fig. 2A and 28). We analyze the behavior of the network in response to four characteristics of the input patterns (pure odors, A or B, and mixed odors, A and B), which are behaviorally important (see Kaissling and Kramer 1990): (1) amplitude, (2) stimulus shape, (3)frequency of stimulus presentation, and (4) ratio of the components in mixed odors. The behavior of the model network exhibits several characteristicsthat agree with biological data: selective neurons respond to only one of the two odor components and nonselective neurons respond to both components. The neurons exhibit a limited number of response patterns, most of them a combination of excitation and inhibition (Fig. 3A and 3B). The recognition of the concentration ratio of odor components is of behavioral importance, but it is not known whether the detection of a precise ratio is achieved at the level of the glomerulus or at higher olfactory centers. Here, we observe amplitude and temporal variations of the response patterns of individual interneurons as a function of the concentration ratio. Interneurons with oscillatory responses code, by temporal changes in their response patterns (Fig. 4A), for ratio variations of the input stimulation. In addition, pairs of neurons respond simultaneously to mixed input of a specific input ratio: in contrast, the first spikes of the responses to other ratios are separated by 25-50 msec; thus, the response latency could be one of the response parameters that indicate ratio detection (Fig. 5). The odor plume formed downwind from the calling female possesses a highly variable structure. Pulsed stimulation improves a male moth ability to orient toward an odor source (Baker et al. 1985; Kennedy 1983). We have therefore observed the behavior of the interneurons in response to pulsed stimulation. We find that some interneurons cannot follow pulsed stimulation beyond a specific cut-off frequency (Fig. 4B). The ability of these neurons to detect a certain frequency range depends on their response pattern; the cut-off frequency of each neuron depends on
'Fifty neurons; connectivity c = 10%(190 synapses); synaptic strength w,j = +l/-l; 30% receptor neurons, 30% excitatory intemeurons, 40%inhibitory intemeurons; sampling step At = 5 msec, which is enough to study the maximal physiological spiking frequencies (Christensen el al. 1989a); membrane time constant 7' - 25 msec; synaptic delays are chosen from a uniform distribution between 10 and for excitatory synapses, and between 10 and 100 msec for inhibitory synapses; the parameters of the sigmoids are T = 1 and 8 = 1 for receptor neurons, T = 0.375 and 8 = 1.5 for the others.
6-m
Model of the Insect Olfactory Macroglomerulus
233
20-40 At
!! 15-30 At
j Lnunu
2040 &
Rs:
1
R6:
15-30 At
I
Figure 2 (A) Response patterns: the amount of activation and inactivation is shown as a function of the stimulation (there At = 5 msec). R1, Activation for the duration of the stimulation. The spiking frequency varies as a function of the amplitude of the input. R2, The activation is followed by an inactive phase after the end of the stimulation. R3, Phasic burst, followed by an inactive phase of the same duration as the stimulus. R4, Phasic burst, followed by a tonic phase of diminished activation or by a phase of nonresponse, and by a short inactive phase after the end of the stimulation. R5, Phasic burst, followed by several phases of inactivation and activation (oscillatory response). R6, Inactivation during the application of the stimulation. The amplitude of the negative potential is a function of the amplitude of the stimulation.
234
C. Linster et al.
Figure 2 (B) Neurons responding with Rl-R6 (duration of each stimulation: 200 msec).
the duration of the stimulation and on the interstimulus interval. Neurons that respond with mixed exatation and inhibition show irregular responses and cannot follow high-frequency stimulation. Neurons that respond with exatation mostly respond continously to high-frequency stimulation. These behaviors depend mainly on the relations between the stimulation frequency, the interstimulus interval, and the temporal parameters of the model. Synaptic delays determine the behavior of mixed responses, while membrane time constants determine the behavior of excitatory responses. The stimulus profiles (rise and fall times of the odor signal) i n d i c a t e irrespective of the stimulus concentration-the distance between the lo-
Model of the Insect Olfactory Macroglomerulus
235
Figure 3 Responses of selective (A) and nonselective(B) neurons to stimulation with one and both odors (duration of each stimulation: 200 msec).
cation of odor perception and the odor source. We observe a number of interneurons that reflect the profile of the stimulation irrespective of its concentration. This depends again on the response patterns; neurons that exhibit purely excitatory responses reflect the input profile by response latency and response duration, whereas neurons that exhibit an oscillatory response have completely different temporal response patterns as a function of the input profile (Fig. 4C). In this section, we have shown that the response patterns of individual neurons reflect various characteristics of the input pattern. Selective
236
C. Linster et al.
neurons indicate the presence, amplitude, and stimulus profile of one component (depending on their response pattern); nonselective neurons indicate the presence, amplitude, and stimulus profile of the mixture of the two components. Some nonselective neurons also reflect the quality of the mixture, that is, the ratio of the components. 5 Influence of the Distribution of Neurons and Synapses
The number and the diversity of the response patterns depend on the total number of neurons, on the distribution of excitation and inhibition in the network, on the number of connections and feedback loops, and on the temporal parameters (i.e., synaptic delays, membrane time constants). The diversity of response patterns grows with the percentage of synapses in the network (all other parameters remaining unchanged). At connectivity c < 2%, afferent synapses cause purely excitatory responses (Rl); around c = 2%, simple mixed responses (R2) and inhibitory responses (R6) appear; at about c = 8%, the majority of the interneurons respond mainly with excitation (R1 and R2). The full diversity and distribution of response patterns described above are observed for most networks around c = 10%. With an increasing number of synapses, the number of response patterns decreases. Due to an increasing network activity, the response patterns tend to oscillate, and the network saturates. Similarly, increasing the inhibitory synapse number beyond 50% introduces oscillations; the total activity in the network decreases. Beyond 60% inhibition, only R3 responses (phasic burst followed by a long inhibitory period) survive. If there is too much excitation in the network (more than forty percent excitatory neurons or more than forty percent receptor neurons), the network becomes unstable and saturates.
Figure 4 Facing page. Responses of selective and nonselective neurons with varying response patterns to stimulation with varying input characteristics. (A) Stimulation with varying ratios of the input components,the sum of the amplitudes of the two components being constant. Several neurons respond with varying temporal response patterns to changing ratios (duration of each stimulation: 50 msec). (B)Neuron 7 responds with phasic bursts to stimulation at low frequencies,and responds continually to stimulation at the same frequency but with shorter interstimulus intervals, because the interstimulus interval approches the membrane constant of the neuron (upper diagram: stimulation duration 30 msec, interstimulus interval 20 msec; middle diagram: stimulation duration 40 msec, interstimulus interval 10 msec; bottom diagram: stimulation duration 20 msec, interstimulus interval 10 msec). (C) Stimulation by input with varying profiles, the rise and fall times vary from 10 msec to 50 msec (stimulationduration 100 msec).
Model of the Insect Olfactory Macroglomerulus
237
6 Discussion
In this section, we discuss the relevance of the results to the specialist system of insects. The model exhibits several behaviors that agree with biological data, and it allows us to state several predictive hypotheses about the processing of the pheromone blend. In the model, we observe two broad classes of interneurons: selective (to one odor component) and nonselective neurons. The fact that a distinct representation of pheromone components in parallel pathways
c.Linster et al.
238
Figure 5 Importance of response latenaes for ratio detection (stimulationduration 50 msec).
coming from the antenna is preserved by some antennal lobe neurons (local interneurons and projection neurons), but not all of them, has been reported in several species: in moths, Mmrducvr sexta (Christensen and 19831, and in the Hildebrand 1987a,b, 1989b) and Bombyx mMj (0cockmach, P e u a e (Boeckh 1976; Burrows etal. 1982; Boeckh and Selsam 1984; Hiisl1990). Selective and nonselective neurons exhibit a variety of mpnse patterns, which fall into three classes. inhibitory, excitatory, and mixed. Such a classification has indeed been pmlpposed for olfactory antennal lobe neurons (localinterneumns and pmjection neurons)in the specialist olfactory system in Mnduaa (Qlristensen et al. 1989a; Christensen and Hildebrand 1987a,b). Similar observations have been reported for Bombyx mon' (0% 1983)and for the cocboach (8urrows et d.1982; Boeckh and Ernst 1987). In our model we observe a number of local interneurons that cannot follow pulsed stimulationbeyond a neuroMIpeclficmt-off fmquency. This frequency depends on the neuron respanse pattern and on the duration of the interstimulusinterval. These d t s agree with data pertainins to antennal lobe neurons (intern-ns and prapction neurons) in Manduca serta (Christensen and Hildebrand 1988) and in Heliothis Oirescens (Christensen et al. 1989b). In both species, some antennal lobe neurons follow pulsed input with phasic bursts up to a cut-off fnquency. Physiological evidencein several species (ChxWemmand Hildebrand 198%; Burrows et al. 1982)has led to the hypothesis that some projection neurons (or local interneurons), may code for pheromone concentration and quality by measuring differences in response latency and duration, instantaneous spike frequency, and total number of spikes. Furthermore,
-
Model of the Insect Olfactory Macroglomerulus
239
the overall response to the correct blend of pheromones may be qualitatively different from the response to some other ratio of pheromones (Christensen and Hildebrand 198%). Our model exhibits characteristics (Figs. 4A and 5) that could substantiate these suggestions. They will be analyzed and discussed in more detail in a forthcoming publication. 7 Conclusion
We have presented an original model of olfactory information processing in the macroglomerulus of insects. This model incorporates very simple ingredients; its c o ~ e c t i v i t yis chosen randomly, from distributions which take into account complete, albeit approximate, biological knowledge. From these simple assumptions, a variety of neuronal responses emerge, some of them strongly resembling those observed in living systems. Our model performs feature extraction on the signal represented in separate input lines. A number of feahws concerning the single odor components as well as their blend are represented in parallel lines by the interneuron network. These results agree with the hypothesis that there are ‘,separate but parallel lines of olfactory information flow between the antenna1 lobe and the protocerebrum, each line carrying information about different aspects of a pheromonal stimulus” (Christensen et al. 1989a). The use of random c o ~ e d i ~ iand t y synaptic delays gives us a means to study the conditions under which such feature extraction can arise, and the diversity of output patterns that are thereby exhibited. Thus, a model built with random c o ~ e d i v i t ysuffices to explain, reproduce, and predict a number of signal processing properties on the olfactory specialist subsystem. The variation of random distribution parameters and delays gives insights into the means whereby natural neural nets may be modulated by higher control mechanisms, be they genetic, adaptive, or instructive.
Acknowledgments This work has been supported in part by EEC BRAIN contract ST2J-0416C and by Ministere de la Recherche et de la Technologie (Sciences de la Cognition).
References Baker, T. C., W i l l i s ,M. A., Haynes, K. F., and Phelan, P.L. 1985. A pulsed cloud of sex pheromones elicits upwind flight in male moths. Physiol. Entomol. 10, 257-265. Boeckh, J. 1976. Aspects of nervous coding of sensory quality in the olfactory pathway of insects. Proceedings of the XV International Congress of Entomology, Washington,19-27 August 1976.
240
C. Linster et al.
Boeckh, J., and Selsam, P.1984.Quantitative investigation of the odor specificity of central olfactory neurons in the American cockroach. Chem. Senses 9(4), 369-380. Boeckh, J., and Emst, K. D. 1987. Contribution of single unit analysis in insects to an understanding of olfactory function. I. Comp. Physiol. A161, 549-565. Boeckh, J.,Ernst, K. D., and Selsam, P. 1989. Double labelling reveals monosynaptic connections between antennal receptor cells and identified intemeurons of the deutocerebrum in the American cockroach. Zool. I&.Anut. 119, 303-312. Burrows, M., Boeckh, J., and Esslen, J. 1982. Physiological and morphological properties of interneurons in the deutocerebrum of male cockroaches which respond to female pheromone. I. Comp. Physiol. 145,447-457. Christensen, T. A., and Hildebrand, J. G. 1987a. Functions, organization, and physiology of the olfactory pathways in the lepidopteran brain. In Arthropod Bruin: Its Evolution, Development, StructureundFunctions, A. P. Gupta, ed. John Wiley, New York. Christensen, T. A., and Hildebrand, J. G. 1987b. Male-specific, sex pheromoneselective projection neurons in the antennal lobes of the moth Munducu sextu. I. Comp. Physiol. A 160, 553-569. Christensen, T. A., and Hildebrand, J. G. 1988. Frequency coding by central olfactory neurons in the sphinx moth Manducu sextu. Chem. Senses 13(1), 123-130. Christensen, T. A., Hildebrand, J. G., and Tomlinson, J. H. 1989a. Sex pheromone blend of Munducu sextu: Responses of central olfactory interneurons to antennal stimulation in male moths. Arch. Insect Biochem. Physiol. 10, 281-291. Christensen, T.A., Mustaparta, H., and Hildebrand, J. G. 1989b. Discrimination of sex pheromone blends in the olfactory system of the moth. Chem. Senses 14(3),463-477. Distler, P. 1990. GABA-immunohistochemistryas a label for identifying types of local intemeurons and their synaptic contacts in the antennal lobe of the American cockroach. Histochemistry 93,617-626. Emst, K. D., and Boeckh, J. 1983. A neuroanatomical study on the organization of the central antennal pathways in insects. Cell Tissue Res. 229, 1-22. Fonta, C.,Sun, X. J.,and Masson, C. 1991. Cellular analysis of odour integration in the honeybee antennal lobe. In The Behuviour and Physiology of Bees, L. J. Goodman and R. C. Fischer, eds., pp. 227-241. C.A.B. International, London. Haberly, L. B., and Bower, J. M. 1989.Olfactory cortex: Model circuit for study of associative memory? TINS 12(7), 133. Hod, M. 1990. Pheromone-sensitive neurons in the deutocerebrum of Periplunetu umericunu: Receptive fields on the antenna. I. Comp. Physiol. A 167,321-327. Kaissling, K-E., and Kramer, E. 1990. Sensory basis of pheromone-mediated orientation in moths. Verh. Dtsch. Zoolo. Ges. 83, 109-131. Kauer, J, S. 1974. Response patterns of amphibian olfactory bulb neurons to odor stimulation. I. Physiol. 243, 695-715. Kennedy, J. S. 1983. Zigzagging and casting as programmed response to windborne odor: A review. Physiol. Entomol. 8, 109-112.
Model of the Insect Olfactory Macroglomerulus
241
Li, Z., and Hopfield, J. J. 1989. Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybernet. 61,379-392. Lynch, G., and Granger, R. 1989. Simulation and analysis of a simple cortical network. In Computational Models of Learning in Simple Neural Systems, R. D. Hawkins and G. H. Bower, eds., pp. 205-238. Academic Press, New York. Lynch, G., Granger, R., and Larson, J. 1989. Some possible functions of simple cortical networks suggested by computer modeling. In Neural Models of Plasticity, J. H. Byrne and W. 0. Berry, eds., pp. 329-361. Academic Press, New York. Masson, C., and Mustaparta, H. 1990. Chemical information processing in the olfactory system of insects. Physiol. Rev. 70(1), 199-245. Meredith, M. 0.1986.Patterned response to odor in mammalian olfactory bulb: The influence of intensity. J. Neurophysiol. 56(3), 572-597. Olberg, R. M. 1983. Interneurons sensitive to female pheromone in the deutocerebrum of the male silkworm moth, Bombyx mori. Physiol. Entomol. 8, 419-428. Rall, W., and Shepherd, G. M. 1968. Theoreticalreconstructionof field potentials and dendrodendritic synapse interactions in olfactory bulb. J. Neurophysiol. 31,884-915. Wilson, M.A., and Bower, J. M. 1988. A computer simulation of olfactory cortex with functional implications for storage and retrieval of olfactory information. In Neural Information Processing Systems, D. Z . Anderson, ed., pp. 114-126. American Institute of Physics, New York. Wilson, M. A., and Bower, J. M. 1989. The simulation of large scale neural networks. In Methods in Neuronal Modelling: From Synapses to Networks, C . Koch and I. Segev, eds., pp. 291-334. MIT Press, Cambridge, MA. Received 6 December 1991; accepted 3 August 1992.
This article has been cited by:
Communicated by Fernando Pineda
Learning Competition and Cooperation Sungzoon Cho James A. Reggia Department of Computer Science, University of Maryland, College Park, MD 20742 USA
Competitive activation mechanisms introduce competitive or inhibitory interactions between units through functional mechanisms instead of inhibitory connections. A unit receives input from another unit proportional to its own activation as well as to that of the sending unit and the connection strength between the two. This, plus the finite output from any unit, induces competition among units that receive activation from the same unit. Here we present a backpropagation learning rule for use with competitive activation mechanisms and show empirically how this learning rule successfully trains networks to perform an exclusive-OR task and a diagnosis task. In particular, networks trained by this learning rule are found to outperform standard backpropagation networks with novel patterns in the diagnosis problem. The ability of competitive networks to bring about context-sensitive competition and cooperation among a set of units proved to be crucial in diagnosing multiple disorders. 1 Introduction
Competitiveactivation mechanisms have recently been proposed as a method for producing competitive or inhibitory interactions between units (nodes) through functional mechanisms instead of inhibitory connections (Reggia 1985). A unit sends output to another unit proportional to the receiving unit's activation as well as the connectionweight between them. Since the total output from a unit is finite, competition arises among the receiving units; a stronger receiving unit with a higher activation level gets more input, indirectly reducing the input to weaker (lower activation) units, often until a clear winnerb) emerges. This process has the same effect as inhibiting other competing units without having direct inhibitory connections. This approach brings about more flexible and context-sensitive information processing than more traditional methods: a set of units can ' A similar idea was independently described in Rumelhart and McClelland (1986).
Neural Computation 5,242-259 (1993) @ 1993 Massachusetts Institute of Technology
Learning Competition and Cooperation
243
compete or cooperate depending on the context (this will be elaborated on later). Applications including print-to-sound transformation, diagnosis, satellite communication scheduling, and models of cerebral cortex have demonstrated that competitive activation mechanisms can work successfully, in many cases with substantially fewer connections than standard approaches (Reggia et al. 1988; Bourret et al. 1989; Peng and Reggia 1990; Reggia et al. 1992; Cho and Reggia 1992). Existing learning methods, however, cannot be used with competitive activation mechanisms because even in feedforward networks there is an implicit recurrent flow of information (to guide the competitive process). This lack of learning methods has greatly limited the usability of competitive activation mechanisms in the past. Thus, we have derived an error backpropagation (EBP) learning rule for networks employing these mechanisms (Cho and Reggia 1991). This new learning rule, referred to as competitive EBP, can be applied to networks with an arbitrary connection structure, but we restrict our attention to simple architectures where one is interested in the final state of a competition, not its process? In other words, our learning rule could be described as training networks to learn a set of fixed points of a dynamic system, unlike some recurrent learning rules that are concerned with a set of trajectories. This paper describes the competitive EBP learning rule and two applications of it where context-dependent competition and cooperation among units play key roles. The first application involves training networks with three units to learn to perform an exclusive-OR (XOR) operation on their inputs. This application illustrates clearly how competition and cooperation among units can function effectively for this simple, linearly nonseparable problem. A second, diagnostic application involves localization of damage in the human central nervous system given the hdings on a patient’s neurological examination. Networks are initially trained to identify the location of single disorders (sites of damage) given a set of manifestations associated with each disorder. Then previously unseen sets of manifestations are presented to the trained networks. It is found that the networks trained with competitive EBP produce better diagnostic hypotheses than networks trained with a standard backpropagation learning rule when multiple disorders are present simultaneously. Solving such multimembership diagnosticproblems is widely recognized to be a difficult task (Peng and Reggia 1990). The following section briefly describes competitive activation mechanisms and the competitive EBP rule in an informal manner [refer to Cho and Reggia (1991) for more details]. Simulation results involving the two applications described above are then presented to demonstrate that this approach can work effectively.
20ur learning rule has also proven to be effective in learning continuous-valued functions (Cho and Reggia 1992).
Sungzoon Cho and James A. Reggia
244
2 Activation Mechanisms and Learning Rule
Given an arbitrarily connected network, let the activation level of unit k at time t, ak(t), be given as
where output O U t k j ( t ) from unit j to unit k is distributed competitively by r
1
(2.2) LIEN
J
The function f k denotes any differentiable activation function. The weight on the connection from unit k to unit j is denoted by wkj, which is assumed to be zero when there is no connection between the two units. The term E k denotes a constant external input to unit k. The network-wide constant parameters a, P, and y represent decay, input gain, and output gain, respectively. Values of Q and p control how fast the activation decays while that of y determines how much output a unit sends in terms of its activation level. The parameter p determines how much competition exists among the units. The larger the value of p, the more competitive the model's behavior. The output O U t k j ( f ) is proportional not only to the sender's activation level aj(t), but also to the receiver's activation level, &(t). Therefore, a stronger unit receives more activation. Another unit I, which also gets input from unit j, can be seen as competing against unit k for the output from unit j because the normalizing factor ClEN w p ; ( f )in the denominator of equation 2.2 constrains the sum of the outputs from unit j to be equal to its activation level, u,(t), when y = 1. The activation sent to unit k, therefore, depends not only on those of the units sending it activation such as unit j , but also on those of its competitors to which unit k has no explicit connections. This has the effect of introducing implicit "feedback (recurrent effects) into networks in locations where explicit connections do not exist. This is why conventional backpropagation and other learning methods are not directly applicable in models where equation 2.2 controls the spread of activation. Applying gradient descent to the usual sum-of-square error measure (Rumelhart et al. 19861, we derived the following weight updating rule (2.3)
where B, = ElEN wlSu;.Since is zero except for those units I that have connections from unit s (i.e., wjs # 0), the summation is actually over only
Learning Competition and Cooperation
245
those units to which unit s sends its activation. Therefore, learning can be carried out by a local computation. The bk values in equation 2.3 are the fixed point solutions of the dynamic system
(2.4) where Ak denotes the difference between the desired and the actual values at output unit k (zero at other units). Equation 2.4 appears to be complex but, in fact, is quite simple: all factors other than the &s in equation 2.4 are constants since this backward dynamic system is run after the forward system (equations 2.1-2.2) reaches equilibrium and the p becomes error signals at the output units have been c o m p ~ t e d When .~ zero (no competition), equation 2.4 reduces to the same backward dynamic system as derived in recurrent backpropagation (Almeida 1987; Pineda 1987) except for normalized weights. Equation 2.4 can be interpreted intuitively as follows. The first term, CjfN d j k b j ( t), represents ak's partial responsibility for the error signal at to unit j . This term also exists in a unit j since it influenced signal standard EBP learning rule (Rumelhart ef al. 1986) and in a recurrent EBP learning rule for noncompetitive activation mechanisms (Pineda 1987). The second term akAk is the error signal arising directly from external teaching signals and is nonzero only for output units. The third term, PCmeNoUth{bk(t) - (I/?) C I € N O U t f r n [ ~ f ( t ) / a r n ] } ,occurs only in Our cornpetitive EBP learning rule for competitive activation mechanisms. The quantity p CmEN o U t h b k ( f ) can be interpreted as accounting for the fact that unit k is indirectly responsible for its own error signal through its influence on the incoming activation signal outh from unit rn (the term outh has factor a k in it). However, unit k is not solely responsible for that error since the activation level of unit k itself is indirectly influenced by those of its competitors, that is, the other units to which unit m sends its output signals. The sum (I/?) C l E N o u t ~ , [ 6 / ( ( t ) / a m ] can be viewed as the amount that needs to be subtracted from unit k to compensate for this indirect influence by competitors. This term was derived from the denominator term B, of o u t h . Learning proceeds as follows. A training pattern is presented as input to the network, the forward dynamic system (equations 2.1-2.2) is run until it reaches equilibrium, the error signal A for each output unit is cal~~~
~~~
3Benaimand Samuelides proved convergence of feedforward networks employing a slightly different competitive activation mechanism (Benaim and Samuelides 1990). Empirically, we encountered neither oscillations nor divergence.
246
Sungzoon Cho and James A. Reggia
culated, the backward dynamic system (equation 2.4) is run to compute the error signals for all the units in the network, and then the weight on each connection is changed according to the learning rule (equation 2.3). These steps are taken for all training patterns and this is said to comprise one epoch. Learning continues until either all training patterns are correctly learned or a preset time limit of 400 epochs expires. Simulations were conducted using Maryland/MIRRORS 11, a simulation software environment that constructs and runs a neural network as specified by the user.
3 Exclusive-OR (XOR)
Exclusive-OR (XOR) has been a popular problem to test learning algorithms due to the historic fact that one-layer linear models and elementary perceptrons are unable to learn the task. Because of its linear nonseparability property, a hidden layer of units is required to learn the task with backpropagation algorithm (Rumelhart et al. 1986). Here we use a competitive activation mechanism and our competitive EBP learning rule to train a network with two input units and an output unit to learn XOR.4 Competition and cooperation among a set of units were found to play a crucial role in performing the task. Figure 1 depicts the three-unit network used in the simulation. Units 1 and 2 represent input units and unit 3 represents the output unit. A sigmoid function, f ( x ) = 1/(1+ e-s(x-0.5)),was chosen as f k ( x ) in equation 2.1 for all k. With larger values of s, the sigmoid function becomes steeper, producing clearer winners and losers during a simulation. However, large values of s also cause the derivatives of the activation functions to vanish, making error signals much less significant. Such "flat spots" have been recognized previously and various cures suggested (Fahlman 1988; Hinton 1989; Cho and Reggia 1991). Here we start with a small s value of 1 and gradually increase it to 10 (this process is not simulated annealing). All of the parameters a,P, and y and competition parameter p were set to 1 and fixed throughout learning. Four input patterns (l,O), (O,l), (l,l),and (0,O)were used with the corresponding target outputs l,l,O, and 0, respectively. A total of 45 different nonsolution initial weight states were used, 32 cases of which were randomly generated and the remaining 13 cases of which were handcrafted so that the initial weight state was located far away from a solution weight state. In all 45 cases, the competitive EBP learning rule successfully changed the initial nonsolution state into a solution state. The mean, median, minimum, and maximum number 40Ur input units are doing more than what ordinary input units do, but since they receive direct external inputs, they are not hidden units.
Learning Competition and Cooperation
247
I
I
El
E2
Figure 1: The three-unit network trained to learn XOR. Weight wi, represents the weight from unit j to unit i. External inputs E l and E2 are fed to input unit 1 and 2, respectively. The activation level at unit 3 at equilibrium is considered to be the output of the network.
of learning epochs taken were 145, 44, 33, and 840, respectively. The networks with handcrafted initial weights were found to take a much longer time than the networks with randomly generated initial weights as expected. All of the final solution states obtained share the property that the connections from input to output units are stronger than those between input units (i.e., wZ1< w31 and wI2< w32). How such sets of weights perform XOR with a competitive activation mechanism is as follows. When the pattern (1,O) is applied to the input units, input unit 2 and output unit 3 compete for the output from input unit 1. Since ~ 3 >1 W Z I ,unit 1 sends a larger amount of activation to unit 3 than unit 2. With a larger input from unit 1, the activation level of unit 3 surpasses that of unit 2. Then, with the help of a slightly larger activation level, unit 3 gets an even larger amount of input from unit 1than does unit 2 since the input amount to unit 3 is proportional not only to the connection weight but also to the activation level of unit 3. This process of increasing the activation level of unit 3 accelerates while unit 2 remains close to where it started, that is, near zero. The network finally reaches equilibrium with unit 3 as the winner and unit 2 as the loser, which means that the activation levels of unit 1 and unit 3 are close to one and that of unit 2 is close to zero. This is exactly what is desired. By symmetry, unit 3 wins with the input pattern (0,l) since ~ 3 >2 w12. With the pattern (l,l),initially two competitions occur at the same time: between unit 1 and unit 3 for
248
Sungzoon Cho and James A. Reggia
unit 2’s activation, and between unit 2 and unit 3 for unit 1’s activation. With both unit 1 and unit 2 receiving external inputs and feeding each other (cooperation),they both win over unit 3, overcoming the fact that w 3 1 > w 2 1 and ~ 3 > 2 ~ 1 2 . 5 Finally with pattern (O,O), unit 3 does not get any significant input from either of the input units so it stays at the initial activation level which is zero. Through competition and cooperation among units, a competitive activation mechanism implements XOR task. 4 Diagnostic Associations
Recently, backpropagation models have been applied to diagnostic problem solving with some success. However, these backpropagation models for diagnosis apply to small, circumscribed decision problems where it is usually assumed that at most a single disorder is present. For example, the system for diagnosing myocardial infarction was limited to determining the presence or absence of that single disorder (Baxt 1990). The previous backpropagation models thus typically perform a pattern classification task (selection of one diagnostic category out of several when given a problem description), and do not in any sense solve general diagnostic problems where one must construct a multiple disorder solution from individual disorders. This is a major limitation because in many diagnostic problems multiple disorders may be present simultaneously (referred to here as multiple disorder patterns) (Peng and Reggia 1990). Further, a diagnostician is often presented with manifestations that are a proper subset of the manifestations associated with a disorder (referred to here as partial manifestation patterns). What would be desirable is to train a network with “textbook examples” consisting of all manifestations associated with each individual disorder (referred to here as prototypical patterns), yet for the trained network to produce reasonably good diagnosis with nonprototypical multiple disorder patterns and partial manifestation patterns. Connectionist models based on backpropagation have not been demonstrated previously to perform well for multiple disorder problems. Our hypothesis was that networks trained by competitive backprop agation learning rule would produce better diagnoses given multiple disorder patterns and partial manifestation patterns since they do not employ inhibitory connections. To assess this, we trained networks using both standard EBP (referred to as standard networks) and competitive EBP (referred to as competitive networks), and then compared their respective performance with previously unseen multiple disorder patterns and partial manifestation patterns.
’This process does not work when w31 >> wZ1,and w32 B w12.
Learning Competition and Cooperation
249
4.1 The Localization Task and Training Networks. The medical information used for training is based on standard reference sources and personal knowledge of one of the authors OR) (Adams and Victor 1985; Brazis et al. 1985). We selected 29 different manifestations (Table 1) and 16 localizations of brain damage, so 16 associations between a disorder and a set of prototypical manifestations had to be learned (Table 2). Each disorder represents damage to a specific part of the brainstem, cerebellar hemispheres, or cerebral hemispheres. With each ,manifestation and disorder assigned a numbered index, these associations can be viewed as 16 binary 1 / 0 patterns (Table 3). Presence of a 1 at the ith row and jth column indicates a causal relation between the ith manifestation and the jth disorder. For all 16 patterns, during training only one output unit was to be turned on, representing the presence of the disorder whose prototypical manifestations were present. These prototypical patterns, created as described above, were found to be linearly separable since the 16 input patterns are linearly independent when viewed as a binary vector (Hertz et al. 1991). Because these prototypical patterns were linearly separable, we used networks with two layers of units, namely, input and output units only. However, this is not an easy problem to solve: we are not concerned here with just classdying input patterns into one of n categories. We are interested in the far more difficult task of identdymg a set of categories when a network has only been trained with singledisorder exemplars. Thus, for example, with diagnosis involving n disorders the cardinality of the output space is 2", not n. This kind of multimembership problem is widely recognized to
Table 1: Manifestations (Input Units) Nos. 1, 2
3,4 5, 6 7, 8 9, 10 11, 12 13, 14 15, 16
17 18, 19 20,21
22,23 24'25 26,27 28'29
Manifestations Left, right hemiparesis Left, right facial paresis Left, right tongue paresis Left, right gaze palsy (conjugate) Left, right intemuclear opthalmoplegia Left, right 3rd nerve palsy Left, right 6th nerve palsy Left, right Horner's syndrome Nystagmus Left, right hemiataxia Left, right touch-proprioceptionimpairment Left, right pain-temperature impairment Left, right facial sensory impairment Left, right hemianopsia Sensory, motor aphasia
Sungzoon Cho and James A. Reggia
250
Table 2 Disorders (outputunits) and Their Corresponding Manifestations No. Disorder
Left medial medulla Right medial medulla Left lateral medulla Right lateral medulla Left medial pons Right medial pons Left lateral pons Right lateral pons Left midbrain Right midbrain Left cerebellum Right cerebellum Left frontal lobe 14 Right frontal lobe 15 Left pareitotemporal lobe 16 Right pareitotemporal lobe
1 2 3 4 5 6 7 8 9 10 11 12 13
Manifestations (see Table 1) 2, 5, 21 1, 6, 20 15, 17, 18, 20, 23, 24 16, 17, 19, 21, 22, 25 2, 4, 6, 7, 9, 13, 17, 18, 21 1, 3, 5, 8, 10, 14, 17, 19, 20 3, 7, 15, 17, 18, 21, 23, 24 4, 8, 16, 17, 19, 20, 22, 25 2, 4, 6, 11, 17, 18, 21, 23 1, 3, 5, 12, 17, 19, 20, 22 17, I8 17,19 2, 4, 6, 8, 29 1, 3,5, 7 21, 23, 25,27, 28 20,22, 24, 26
be a very challenging problem in statistical pattern recognition and diagnostic problem solving (Peng and Reggia 1990). There is no generally recognized ideal solution to such problems at present, and any approach (neural network or otherwise) that can approximate solutions is worth investigation. The standard network we used is shown in Figure 2a and the competitive network is shown in Figure 2b. The standard network used a single bias unit that was always on. The multiple "bias units" in the competitive network were an experiment: they were introduced originally to develop as feature detectors. In other words, they were expected to selectively turn on for certain features in the input manifestation patterns. 4.2 Learning Prototypical Patterns (Single-Disorder Patterns). Both standard and competitive networks learned the prototypical patterns rather easily! First, we successfully trained standard networks with zero, four, and eight hidden units in around 90, 300, and 210 epochs, respectively. For each input pattern of a set of manifestations, the correct 6Parameters a, p, 7, and p in equations 2.1 and 2.2 were set to 1.0. Parameter s increased linearly from 3 to 8 over 200 epochs and then remained fixed. Constant 0.1 was added to the derivative of the activation function to accelerate learning (Fahlman 1988). For both forward and backward dynamic systems equilibrium was assumed if every dynamic variable changed less than 0.01.
Learning Competition and Cooperation
251
Table 3: Manifestation-Disorder Associations (I/O patterns) Manif.
No.
1
1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
1 1
1
Disorder No. 2 3 4 5 6 7 8 9 10 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
12 13 14 15 16 1 1 1 1 1 1 1
1
1 1 1
1 1 1 1 1 1 1 1 1
disorder unit became strongly activated while others did not. Since the standard networks with no hidden units gave best generalization performance, we describe results only from the network with no hidden units (shown in Fig. 2a). We also trained 35 different competitive networks. Five different numbers of bias units were used, ranging from zero to four. For each case, seven different competitive networks were generated by randomly assigning initial weights. Any competitive network with at least one bias unit learned the training patterns successfully in between 21 and 45 epochs, with an average of 36 epochs? The bias units did not learn to become feature detectors, but functioned in a fashion equivalent to a single bias unit that always came on. Furthermore, the competitive networks with exactly one bias unit produced the best diagnosis for multiple disorder patterns and partial manifestation patterns. Thus, we 'Although competitive networks took fewer epochs in training than standard networks, each epoch with a Competitive network takes more real time due to the recurrent nature of competitive activation mechanisms.
Sungzoon Cho and James A. Reggia
252
b.
a.
OUTPUT
biaa
OUTPUT
I . . . ]
'1 000
... INPUT
0
000
...
0
INPUT
Figure 2: (a) Standard network and (b) competitive network. The standard network shown in (a) consists of 29 input units representing the manifestations listed in Table 1, 16 output units representing the disorders listed in Table 2, and a bias unit. Each input unit and the bias unit is connected to every output unit. Standard networks with a hidden layer of four and eight units were also trained, but did not give a better performance in either training or testing and so are not considered furthermore. Initially, weights were randomly chosen from a uniform distribution in the range of [-0.3,0.3]. The competitive network shown in (b)has the same set of input and output units and connections as the standard network, except that it has a set of zero to four bias units, receiving connections from input units (see text for explanation). A total of 35 competitive networks (seven for each number of bias units) were randomly generated with initial weights picked from a uniform distribution in the range of [0,1]. Input units are fully connected to bias units, which are in turn fully connected to output units. Both incoming and outgoing connections of the bias units were updated according to the learning rule.
present only results from one of the competitive networks with one bias unit. The weight values of the trained standard network with no hidden unit (but one bias unit) and those of the trained competitive network with one bias unit are shown in Figure 3a and b, respectively. Note that both networks developed a similar but not identical pattern of excitatory connections, but the competitive network does not have any inhibitory connections. If the trained networks were pruned, that is, if connections with very small valued weights were removed from the networks, the competitive network would become much smaller than the standard network. As we explain below, examination of these weight values provides insight as to why competitive networks produce a better diagnosis than standard networks with multiple disorder patterns and partial manifestation patterns.
Learning Competition and Cooperation
253
a.
b.
Figure 3: Connection weights of trained (a) standard and (b) competitive networks. Each box represents a connection from an input to an output unit. The size of a circle in the ith row and jth column is proportional to the size of the connection weight from input unit j - 1 to output unit i with the bias unit denoted as unit 0 (leftmost column). The filled circles represent positive weight values and the open circles represent negative weight values. Note that the trained standard network has a large number of inhibitory weights, and that the bias unit in the standard network inhibits output nodes while that in the competitive network excites them.
4.3 Testing Multiple Disorder Patterns. For multiple disorder patterns, the combined manifestations of two disorders were presented simultaneously to the trained networks. A total of 48 multiple disorder
254
Sungzoon
Cho and James A. Reggia
input pattern pairs were selected for testing (with Pi denoting the pattern of input manifestations associated with disorder i, PI, plus each of the preceding 15 patterns, PI, plus each of the preceding 13 patterns, PI, plus each of the preceding 11 patterns, and PI, plus each of the preceding 9 patterns were presented). It turns out that 4 of these 48 multiple disorder patterns are the same as one of the two combined original input patterns, so there are only 44 truly multiple disorder patterns. The trained networks would be performing ideally if they produced two clear winning disorder units corresponding to those disorders whose manifestations were presented together.8 The standard network activated the two correspondingdisorder units in only 16 cases, and only one disorder unit in the remaining 28 cases. In addition, most of the winners in a total of 44 cases were not fully activated (see Table 4). This weak activation in the output layer of the standard network, representing the failure to generate two clear correct winners, is attributed to the large number of inhibitory connections that are present (see negative weights in Fig. 3a). When manifestations are turned on that are associated with more than one disorder, the corresponding disorder units receive strong inhibitory input as well as excitatory input. Consider, for instance, the case of disorders 1 and 16. Disorder 1 is associated with manifestations 2, 5, and 21 while disorder 16 is associated with manifestations 20, 22, 24, and 26. When all seven of these manifestations were presented, however, neither disorder 1 nor disorder 16 turned on strongly (first row of Table 4). This is because disorder 1 has inhibitory connections from manifestations 20,22,24, and 26, all of which are associated with disorder 16 (see Fig. 3a), thus sending inhibition to disorder 1. Similarly, manifestations 2, 5, and 21 which are associated with disorder 1 send inhibitory activations to disorder 16, thus resulting in a weak activation for disorder 16 (see Fig. 3a). Manifestation units not only send excitatory signal to the associated disorder units, but also send inhibitory signal to the disorder units that are not associated with them. This prevents the standard network from producing clear-cut multiple winners when multiple disorders are present. The competitive network, on the other hand, turned on both of the two associated disorder units very strongly in 38 out of the 44 cases, clearly outperforming the standard network (Table 5). Use of only excitatory connections coupled with the competitive activation mechanism enabled the network to produce multiple winners in the disorder layer
*The actual situation is more complex than this as it is possible that the union of manifestations for two disorders could correspond to the manifestations of another single disorder.
Learning Competition and Cooperation
255
Table 4 Multiple-Disorder Testing with Standard NetworkQ Disorder
No.
1
116 2 16 3 16 4 16 5 16 6 16 7 16 8 16 9 16 10 16 11 16 12 16 13 16 14 16 15 16 114 2 14 3 14 4 14 5 14 6 14 7 14 8 14 9 14
.5
2
3
4
Activation Level at Each Output Unit 5 6 7 8 9 10 11 12 13
2 12 3 12 5 12 7 12 9 12 11 12 1 10 2 10 3 10 4 10 5 10
6 10 7 10 8 10 9 10
15
.4
.5
16 .6 .7 .3
.4
.6
.8 .4
.8
.5 .8 .6 .4
.7
.5 .5 .5 .5 .5
.6 .6
.4
.3 .2
.5 .6
.7 .5 .5 .4
.8
10 14 11 14 12 14 13 14
112
14
.3
.6 .6
.2
.6
.4
.2 .2
.6
.5 .9
.9 .9
.9
.5
.5
.7 .9 .6 .6
.5
.2
.8 .6
.5
'Entries less than 0.2 not shown.
when the two winners are necessary to account for the input manifestations (see Fig. 3b). Although there are some additional disorder units
Sungzoon Cho and JamesA. Reggia
256
Table 5 Multiple-Disorder Testing with Competitive Networka Disorder
No. 116 2 16 3 16 4 16 5 16 6 16 7 16 8 16 9 16 10 16 11 16 12 16 13 16 14 16 15 16 114 2 14 3 14 4 14 5 14 6 14 7 14 8 14 9 14 10 14 11 14 12 14 13 14 112 2 12 3 12 5 12 7 12 9 12 11 12 110 2 10 3 10 4 10 5 10 6 10 7 10 8 10 9 10
1 1
1
1
Activation Level at Each Output Unit 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#Entriesless than 0.2not shown.
turned on, those "nonperfect" diagnoses are not necessarily undesirable in a clinical sense. For instance, when P,and PI,are presented, not only
Learning Competition and Cooperation
257
disorders 3 and 14, but also disorder 7 is activated. This is a reasonable diagnosis given the fact that seven out of eight manifestations associated with disorder 7 are present (Peng and Reggia 1990). 4.4 Testing Partial Manifestation Patterns. A total of 29 partial manifestation patterns, each involving only one input unit (manifestation) being activated, were presented to the standard network as well as to the competitive network. The majority (19) of manifestations are associated with more than one disorder while the rest (10) of them are associated with only one disorder (manifestation 9 through 14 and 26 through 29). In the majority of cases, both the standard and the competitive network activated multiple disorder units partially, producing no clear winners but a set of alternatives. These results are similar to a "differential diagnosis'' that a human diagnostician might give. It is with the remaining 10 cases where the input manifestation can be caused by only a single disorder that the two types of networks performed differently. The standard network did not produce a clear winner in any of these 10 cases; most of the disorder units were activated rather weakly. Input from a single manifestation unit was not enough to fully activate the disorder units? The competitive network, however, turned on the respective corresponding disorder units as clear winners since it is the relative amount of input to the disorder units, not the absolute amount, that determines the result of competition. Note that even with a small difference in the input, competitive networks can produce clear winners and losers.
5 Conclusion
Competitive activation mechanisms are an alternative way of controlling the spread of activation in neural networks. They have been shown to be effective in some applications, but until now lacked an effective supervised learning method. We therefore have derived a backpropagation learning rule for use with competitive activation mechanisms and have described it in this paper. This learning rule can be viewed as a generalization of a previous form of recurrent backpropagation with normalized weights (Pineda 1987). To demonstrate that our new learning rule can work effectively, we first applied it to the task of exclusive-OR. Competitive networks with three units were successfully trained to perform this operation on their 9This was studied further by independently training and testing standard networks with normalized input patterns. Such networks produced strong winners in the corresponding single disorder units. However, when the single manifestation presented was associated with more than one disorder, the standard networks trained with normalized patterns produced strong winners in all the associated disorder units, which is an unacceptable result. Using normalized input patterns also produced poor results with multiple-disorder patterns.
Sungzoon Cho and James A. Reggia
258
inputs. Having hidden units was unnecessary because the two input units and one output unit could compete to produce correct behavior by the network. The second application involved locating areas of brain damage given manifestations (neurological signs and symptoms) as input. It should be noted that a diagnosis problem like this is not just a pattern classification problem where the given set of manifestations is classified into one of several categories. Rather, it also involves constructing a hypothesis that includes more than one disorder. Standard backpropagation networks have not been demonstrated to handle multiple-disorder diagnosis problems effectively. The standard backpropagation network we studied developed many inhibitory connections when trained with just prototypical cases. When manifestations of multiple disorders were subsequently presented simultaneously, these inhibitory connections prevented the network from producing appropriate diagnostic hypotheses. Competitive networks, on the other hand, performed qualitatively better in such cases. In competitive networks, the disorder units were able to "cooperate" in producing multiple winners when appropriate even though they were only trained on prototypical single-disorder cases. Competitive networks also did better with partial manifestation patterns than standard networks. In summary, we have derived and demonstrated the effectiveness of a supervised learning rule for competitive activation mechanisms. Together with an unsupervised learning rule previously developed for competitive activation mechanisms (Sutton et al. 19901, this greatly increases the range of tasks to which this approach can be applied. Acknowledgments
This work was supported by N M awards NS29414 and NS16332. Dr.Reggia is also with the Institute for Advanced Computer Studies at the University of Maryland. References Adam, R., and Victor, M. 1985. Principles of Neurology. McGraw-Hill, New
York. Almeida, L. 1987. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. Proceedings of the IEEE First Annual International Conference on Neural Nehoorks, Vol. 11, pp. 609418,San Diego, CA. Baxt, W. 1990. Use of an artificial neural network for data analysis in clinical decision making. Neural Camp. 2,480-489. Benaim, M., and Samuelides, M. 1990. Dynamical properties of neural nets using competitive activation mechanisms. Proceedings of lnternational Joint Conference on Neural Networks, Vol. III, pp. 541546, San Diego, CA.
Learning Competition and Cooperation
259
Bourret, P., Goodall, S., and Samuelides, M. 1989. Optimal scheduling by competitive activation: Application to the satellite antennae scheduling problem. Proceedings of International Joint Conferenceon Neural Networks, Vol. I , pp. 565572, Washington, DC. Brazis, I?, Masdeu, J., and Biller, J. 1985. Localization in Clinical Neurology. Little, Brown, Boston. Cho, S., and Reggia, J. 1991. A recurrent error back-propagation rule for competitive activation mechanisms. Tech. Rep. CSTR-2661, Department of Computer Science, University of Maryland. Cho, S., and Reggia, J. 1992. Learning visual coordinate transformations with competition. Proceedings of International Joint Conference on Neural Networks, Baltimore, MD, Vol. IV,pp. 49-54. Fahlman, S. 1988. Faster-learningvariations on back-propagation: An empirical study. Proceedings of the 1988 Connectionist Models Summer School, pp. ,3845, Pittsburgh, PA. Hertz, J., Krogh, A., and Palmer, R. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Redwood City, CA. Hinton, G. 1989. Connectionist learning procedures. Artificial Intelligence 40, 185-234. Peng, Y.,and Reggia, J. 1990. Abductive Inference Models for Diagnostic ProblemSolving. Springer-Verlag,Berlin. Pineda, F. 1987. Generalization of back-propagation to recurrent neural networks. Phys. Rev.Lett. 59(19), 2229-2232. Reggia, J. 1985. Virtual lateral inhibition in parallel activation models of associative memory. Proceedings of the 9th International Joint Conferenceon Artificial Intelligence, Vol. 1, pp. 244-248, Los Angeles, CA. Reggia, J., D’Autrechy, C. L., Sutton, G., and Weinrich, M. 1992. A competitive distribution theory of neocortical dynamics. Neural Comp. 4, 287-317. Reggia, J., Marsland, P., and Berndt, R. 1988. Competitive dynamics in a dualroute connectionist model of print-to-sound transformation. Complex Syst. 2, 509-547. Rumelhart, D., Hinton, J., and Williams, R. 1986. Learning representations by back-propagating errors. Nature (London) 323,533-536. Rumelhart, D., and McClelland, J. 1986. On learning the past tenses of English verbs. In Parallel Distributed Processing, Vol 2: Foundations, D. Rumelhart, J. McClelland, and the PDP Research Group, eds., pp. 216-271. MIT Press, Cambridge, MA. Sutton, G., Reggia, J., and Maisog, J. 1990. Competitive learning using competitive activation rules. Proceedings of International Joint Conference on Neural Networks, Vol. II, pp. 285-291, San Diego, CA. Received 16 May 1991; accepted 26 June 1992.
This article has been cited by: 2. Yu G. Smetanin. 1998. Neural networks as systems for recognizing patterns. Journal of Mathematical Sciences 89:4, 1406-1457. [CrossRef]
Communicated by Christof Koch
Constraints on Synchronizing Oscillator Networks David E. Cairns Roland J. Baddeley Leslie S. Smith Centrefor Cognitiveand Computational Neuroscience, University of Stirling, Stirling, Scotland, FK9 4LA
This paper investigates the constraints placed on some synchronized oscillator models by their underlying dynamics. Phase response graphs are used to determine the phase locking behaviors of three oscillator models. These results are compared with idealized phase response graphs for single phase and multiple phase systems. We find that all three oscillators studied are best suited to operate in a single phase system and that the requirements placed on oscillatory models for operation in a multiple phase system are not compatible with the underlying dynamics of oscillatory behavior for these types of oscillator mode1s
.
1 Introduction Following observations of oscillations and synchronization behavior in cat visual cortex (Eckhorn et al. 1989; Gray et al. 1989a) a number of interpretations have been put forward to explain these results (Gray et al. 1989b; Eckhorn et al. 1988; Grossberg and Somers 1991; Shastri 1989; Sompolinsky et al. 1990). It has been suggested that a possible interpretation of the observed synchronization behavior is that the brain could be using synchronized oscillations as a method of solving the binding problem (Von der Malsburg and Schneider 1986). If a cluster of nodes that share a common property are synchronized, they are thus labeled as belonging to one group. Other synchronized nodes that are in a different phase of an oscillatory cycle are effectively labeled as a separate group. By using this method, it can be seen that a number of different entities may be stored simultaneously, each represented by a different phase in an oscillatory cycle. A fundamental requirement behind these theories is that groups of nodes should be able to move into and remain in separate synchronized phases. A simple but effective architecture that enables synchronization to take place is lateral coupling. Lateral connections between node pairs Neural Compufufion 5,260-266 (1993) @ 1993 Massachusetts Institute of Technology
Synchronizing Oscillator Networks
261
transfer a measure of the activation state of one node to the other. This causes a change in the period of the receiving node and thereforea change in its phase. We investigate the response of three generic oscillator models to this type of effect and determine whether or not they are capable of supporting multiple phases as required by the above theories in order to perform binding. 2 Method
Three studies were performed, one for each of the oscillator models. To provide a small but general cross section, we chose one simple oscillator model and two biological models. For the simple oscillator, a leaky integrator model was chosen to illustrate the most basic phase response one can obtain from a nonlinear system (Appendix A.1). As an example of models of cellular oscillations or potential pacemaker cells, a reduced version of the Hodgkin-Huxley cell membrane model (the Morris-Lecar oscillator) was chosen [Appendix A.2 (Rinzel and Ermentrout 1989)l. At the multicellular level, an oscillator based on an original model of excitatory/inhibitory cell cluster interactions by Wilson and Cowan (1972) was used [Appendix A.3 (Wang et al. 199011. The technique for obtaining the phase response graphs was obtained from Rinzel and Ermentrout’s original study of the dynamics of the Morris-Lecar model (Rinzel and Ermentrout 1989). Each oscillator was driven by a constant input until the period of the oscillation had stabilized. This gave a base period Ab. Driving the oscillator by the constant input and starting from a point just after peak activation, trials were made for points across the phase of the oscillation. For each successive trial, the instant of delivery of an entraining signal to the oscillator was increased. Each entraining input was of a constant size and was delivered for a constant proportion (0.025) of the period of the oscillator. The entraining input caused a change in the period of the oscillator. A measure of the relative phase shift caused by the entraining input was calculated according to equation 2.1. The cumolative results of the trials allowed the production of phase response graphs for each oscillator. (2.1) where A0 is the phase shift, Ab is the normal period, and A, is the new period. 3 Discussion
The following discussion relates how the results of the study (shown in Fig. 1) compare with an idealized phase response behavior that one
262
D. E. Cairns, R. J. Baddeley, and L. S. Smith
would like in a single phase and a multiple phase system. For a single phase system where all nodes move toward a globally synchronized activation, the ideal phase response behavior can be represented by the graph in Figure 2a. An entraining signal causes the phase of a receiving node to move in the direction of the phase of the node producing the signal. The degree of phase shift is proportional to the difference between the two nodes and thus causes a steady convergence with minimal possibility of overshoot. The direction of phase shift is determined by the difference in phase, the phase shift being in the direction of the shortest "route" to synchrony. A node with this form of behavior will always attempt to synchronize with the originator of any signal and will remain unperturbed only when in synchrony. This represents an idealized behavior for a single phase system, however any system that has zero phase shift at 0 and 1 with a monotonic decrease in phase in the region 0-0.5 and a monotonic increase in phase in the region 0.5-1.0 (with a discontinuity at 0.5) will cause synchronization to occur (Niebur et al. 1991; Sompolinsky et al. 1990). An example of the phase response for an oscillator in a multiple phase system is shown in Figure 2b. The oscillator maintains the requirements for phase locking with entraining signals arriving close to the phase of a receiving node. However, if the entraining signals arrive further out of phase then no phase shift occurs. This "dead zone" allows for the coexistence of multiple phase groups where inputs arriving from each out-of-phase group do not perturb the receiving group. The above phase response behavior is atypical of most oscillatory dynamics. The frequency of a node is usually increased or decreased as a result of extra input. Only in cases where the "activation" of the node is saturated (for example, when it has reached its peak or is in a refractory period), will little phase shift occur. The extended region of low response required for multiple phases is unlikely to be present in the basic dynamics of most oscillator models. Comparing these requirements with the phase graphs of the oscillators under study, it can be seen that they are best suited to single phase/synchronized activation systems. All three models exhibit an almost linear positive phase convergence in the latter half of their phase. In the case of the two neurophysiologicallybased systems some negative movement is also observed in the first half of the phase plane. Although none of the phase responses is ideal, they are sufficient to allow all of the models to exhibit effective synchronization behavior. Conversely, the oscillators studied do not show the type of behavior necessary for a multiple phase system. They do not possess significant regions of low response to entraining input in their mid-phase region. Consequently these oscillators do not allow for separate phases to coexist stably in a system. They will always be attempting to cause global synchronization. This would favor a network where one population is synchronized against a background of incoherent activity (Sompolinsky et al. 1990; Niebur et al.
SynchronizingOscillator Networks
263
Phase Shift
' u a 4
0.2
0.6
0.4
0.8
1
Input 0.50 ._._Input 0.25
-0.2 0.2 (C)
0.4
0.6
0.E
1
Phase
Figure 1: Phase response graphs. (a) Leaky integrator model, (b) Morris-Lecar cell membrane model, and (c) Wang et al. cell cluster model. Each graph shows the change in phase that occurs when an oscillator is perturbed at a given point in its phase. The x axis gives the phase of the oscillator at the point it is perturbed and the y axis the degree of perturbation in terms of phase shift. The amount of entraining input I by which the oscillator is stimulated is given as a fraction of the driving input of the oscillator. 1991; Koch and Schuster 19921, thus allowing figure ground separation but not labeling of multiple objects by phase (Shastri 1989). 4 Conclusion
This paper indicates there is a limit to the number of stable phases one can expect a system of interacting oscillators to maintain and that this limit is low. The results give support to models that use similar oscillators to achieve low level synchronization for the purposes of coherent activation. For models that use synchronized oscillations and multiple phases
D. E. Cairns, R. J. Baddeley, and L. S. Smith
264
Figure 2: Idealized phase response graphs. (a)Single phase system; (b)multiple phase system. as a method to solve the binding problem, they show that the number of phases available is likely to be significantly less than the minimum required to perform useful computation.
Appendix A Oscillator Models A.l Leaky Integrator
where T is the tonic input (1.0) k = 0.95 E is the entraining input (0.5,0.25) O = 19.93
A.2 Morris-Lecar
dv dt = -iion(V,
+T+ E
W)
(A.2)
Synchronizing Oscillator Networks
265
where
gG, = 1.1
v is the voltage
Vl = -0.01
q5 = 0.2
w is the fraction of K+ channels open g~ = 2.0 T is the tonic input (0.28) g~ = 0.5 E is the entraining input (0.14,0.07)
~2 ~3 ~4
= 0.15 = 0.0 = 0.3
V K = -0.7 V L = -0.5
A.3 Wang et al.
dyi-- --Yi + G, dt
T,
F ( x ) = (1 - V ) X T~ = 0.9
6, = 0.4
+ Tyx;)
(-T
yyY
+ ~2
(0 I Q I 1)
T,, = 1.0
1.0 Oy = 0.6 TW = 1.9 f = 0.2 A, = 0.05 Tyx= 1.3 j = 0.2 A, = 0.05 Tw = 1.2 Q = 0.4 a! = 0.2 /3 = 0.14
T~ =
(A.9)
(A.12)
I; is the tonic input (0.3) Siis the entraining input (0.15,0.075)
Acknowledgments The authors would like to thank the members of CCCN for useful discussions in the preparation of this paper, in particular Peter Hancock and Mike Roberts for their helpful comments on the draft versions. Roland Baddeley and David Cairns are both funded by SERC and Leslie Smith is a member of staff in the Department of Computing Science and Mathematics at the University of Stirling. References Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? B i d . Cybernet. 60, 121-130. Eckhorn, R., Reitboeck, H. J., Arndt, M., and Dicke, I? 1989. Feature linking via stimulus-evoked oscillations: Experimental results from cat visual cortex
266
D. E. Cairns, R. J. Baddeley, and L. S. Smith
and functional implications from a network model. Proc. Intl. Joint Conf. Neural Networks (Washington), pp. 723-730. Gray, C. M.,Konig, P., Engel, A. K., and Singer, W. 1989a.Oscillatory responses in cat visual cortex exhibit inter-columnar synchronisation which reflects global stimulus properties. Nature (London) 338, 1698-1702. Gray, C. M.,Konig, P., Engel, A. K., and Singer, W. 1989b. Synchronisation of oscillatory ~psponsesin visual cortex: A plausible mechanism for scene segmentation. Proc. lntl. Symp. Synergetics Cognition, Vol. 45,C24,pp. 82-98. Grossberg, S., and Somers, D. 1991. Synchronised oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks 4,453-466. Koch, C., and Schuster, H. 1992. A simple network showing burst synchronization without frequency locking. Neural Comp. 4(2), 211-223. Niebur, E., Schuster, H. G., Kammen, D. M.,and Koch, C. 1991. Oscillator-phase coupling for different two-dimensional network connectivities. Phys. Rev. A 44,6895-6904. Rinzel, J.,and Ermentrout, G. B. 1989. Analysis of neural excitability and oscillations. In Methods in Neuronal Modeling-From Synapses to Networks. C. Koch and I. Segev, eds. MIT Press, Cambridge. Shastri, L. 1989. From simple associntions to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings. Tech.Rep. University of Pennsylvania. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a neural network of coupled oscillators. Proc. Natl. Acad. Sci. U.S.A. 87, 7200-7204. von der Malsburg, C., and Schneider, W. 1986. A neural cocktail-partyprocessor. Biol. Cybernet. 54, 29-40. Wan& D.,Buhmann, J., and von der Malsburg, C. 1990. Pattern segmentation in associative memory. Neural Comp. 2,96106. Wilson, H.P.,and Cowan, J. D. 1972. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. 1 - 1 2 . Received 19 May 1992; accepted 3 September 1992.
This article has been cited by: 2. G. Frank, G. Hartmann, A. Jahnke, M. Schafer. 1999. An accelerator for neural networks with pulse-coded model neurons. IEEE Transactions on Neural Networks 10:3, 527-538. [CrossRef] 3. S. Campbell, DeLiang Wang. 1996. Synchronization and desynchronization in a network of locally coupled Wilson-Cowan oscillators. IEEE Transactions on Neural Networks 7:3, 541-554. [CrossRef] 4. Alfred Nischwitz, Helmut Glünder. 1995. Local lateral inhibition: a key to spike synchronization?. Biological Cybernetics 73:5, 389-400. [CrossRef]
Communicated by Ralph Liiker
Learning Mixture Models of Spatial Coherence Suzanna Becker Geoffrey E. Hinton Department of Computer Science, University of Toronto, Toronto, Ontario, Canada M5S 1A4
We have previously described an unsupervised learning procedure that discovers spatially coherent properties of the world by maximizing the information that parameters extracted from different parts of the sensory input convey about some common underlying cause. When given random dot stereogramsof curved surfaces, this procedure learns to extract surface depth because that is the property that is coherent across space. It also learns how to interpolate the depth at one location from the depths at nearby locations (Beckerand Hinton 1992b). In this paper, we propose two new models that handle surfaces with discontinuities. The first model attempts to detect cases of discontinuities and reject them. The second model develops a mixture of expert interpolators. It learns to detect the locations of discontinuities and to invoke specialized, asymmetric interpolators that do not cross the discontinuities. 1 Introduction
Standard backpropagation is implausible as a model of perceptual learning because it requires an external teacher to speclfy the desired output of the network. We have shown (Becker and Hinton 1992b) how the external teacher can be replaced by internally derived teaching signals. These signals are generated by using the assumption that different parts of the perceptual input have common causes in the external world. Small modules that look at separate but related parts of the perceptual input discover these common causes by striving to produce outputs that agree with each other (see Fig. la). The modules may look at different modalities (e.g., vision and touch), or the same modality at different times (e.g., the consecutive 2-D views of a rotating 3-D object), or even spatially adjacent parts of the same image. In previous work, we showed that when our learning procedure is applied to adjacent patches of images, it allows a neural network that has no prior knowledge of depth to discover stereo disparity in random dot stereograms of curved surfaces. A more general version of the method allows the network to discover the best way of interpolating the depth at one location from the depths at nearby locations. We first summaNeural Computation 5,267-277 (1993) @ 1993 Massachusetts Institute of Technology
268
Suzanna Becker and Geoffrey E. Hinton
rize this earlier work, and then introduce two new models that allow coherent predictions to be made in the presence of discontinuities. The first assumes a model of the world in which patterns are drawn from two possible classes: one which can be captured by a simple model of coherence, and one which is unpredictable. This allows the network to reject cases containing discontinuities. The second method allows the network to develop multiple models of coherence, by learning a mixture of depth interpolators for curved surfaces with discontinuities. Rather than rejecting cases containing discontinuities, the network develops a set of location-specific discontinuity detectors, and appropriate interpolators for each class of discontinuities. An alternative way of learning the same representation for this problem, using an unsupervised version of the competing experts algorithm described by Jacobs et al. (1991), is described in Becker and Hinton (1992a). 2 Learning Spatially Coherent Features in Images
Using a modular architecture as shown in Figure la, a network can learn to model a spatially coherent surface, by extracting mutually predictable features from neighboring image patches. The goal of the learning is to produce good agreement between the outputs of modules that receive input from neighboring patches. The simplest way to get the outputs of two modules to agree is to use the squared difference between the outputs as a cost function, and to adjust the weights in each module so as to minimize this cost. Unfortunately, this usually causes each module to produce the same constant output that is unaffected by the input to the module and therefore conveys no information about it. We would like the outputs of two modules to agree closely (i.e., to have a small expected squared difference) relative to how much they both vary as the input is varied. When this happens, the two modules must be responding to something that is common to their two inputs. In the special case when the outputs, da, db, of the two modules are scalars, a good measure of agreement is
where V is the variance over the training cases. Under the assumption that da and db are both versions of the same underlying gaussian signal that have been corrupted by independent gaussian noise, it can be shown that I is the mutual information (Shannon and Weaver 1964)between the underlying signal and the average of d, and db. By maximizing I we force the two modules to extract as pure a version as possible of the underlying common signal.
2.1 The Basic Stereo Net. We have shown how this principle can be applied to a multilayer network that learns to extract depth from ran-
Learning Mixture Models
269
max I
vu
Q Q hidden units
Figure 1: (a)Two modules that receive input from corresponding parts of stereo images. The first module receives input from stereo patch A, consisting of a horizontal strip from the left image (striped) and a corresponding strip from the right image (hatched). The second module receives input from an adjacent stereo patch B. The modules try to make their outputs, da and db, convey as much information as possible about some underlying signal (i.e., the depth) which is common to both patches. (b) The architecture of the interpolating network, consisting of multiple copies of modules like those in (a) plus a layer of interpolating units. The network tries to maximize the information that the locally extracted parameter dc and the contextually predicted parameter dc convey about some common underlying signal. We actually used 10 modules and the central 6 modules tried to maximize agreement between their outputs and contextually predicted values. We used weight averaging to constrain the interpolating function to be identical for all modules. dom dot stereograms (Becker and Hinton 1992b). Each network module received input from a patch of a left image and a corresponding patch of a right image, as shown in Figure la. Adjacent modules received input from adjacent stereo image patches, and learned to extract depth by trying to maximize agreement between their outputs. The real-valued depth (relative to the plane of fixation) of each patch of the surface gives rise to a disparity between features in the left and right images; because that disparity is the only property that is coherent across each stereo image, the output units of modules were able to learn to accurately detect relative depth. 2.2 The Interpolating Net. The basic stereo net uses a very simple model of coherence in which an underlying parameter at one location is assumed to be approximately equal to the parameter at a neighboring location. This model is fine for the depth of frontoparallel surfaces but it
270
Suzanna Becker and Geoffrey E. Hinton
is far from the best model of slanted or curved surfaces. Fortunately, we can use a far more general model of coherence in which the parameter at one location is assumed to be an unknown linear function of the parameters at nearby locations. The particular linear function that is appropriate can be learned by the network. We used a network of the type shown in Figure lb. The depth computed locally by a module, d,, was compared with the depth predicted by a linear combination d, of the outputs of nearby modules, and the network tried to maximize the agreement between d, and d,. The contextual prediction, d,, was produced by computing a weighted sum of the outputs of two adjacent modules on either side. The interpolating weights used in this sum, and all other weights in the network, were adjusted so as to maximize agreement between locally computed and contextually predicted depths. To speed the learning, we first trained the lower layers of the network as before, so that agreement was maximized between neighboring locally computed outputs. This made it easier to learn good interpolating weights. When the network was trained on stereograms of cubic surfaces, it learned interpolating weights of -0.147, 0.675, 0.656, -0.131 (Becker and Hinton 1992b). Given noise free estimates of local depth, the optimal linear interpolator for a cubic surface is -0.167, 0.667, 0.667, -0.167. 3 Mixture Models of Coherence
The models described above were based on the assumption of a single type of coherence in images. We assumed there was some parameter of the image that was either constant for nearby patches, or varied smoothly across space. In natural scenes, these simple models of coherence may not always hold. There may be widely varying amounts of curvature, from smooth surfaces, to highly curved spherical or cylindrical objects. There may be coherent structure at several spatial scales; for example, a rough surface like a brick wall is highly convoluted at a fine spatial scale, while at a coarser scale it is planar. And at boundaries between objects, or between different parts of the same object, there will be discontinuities in coherence. It would be better to have multiple models of coherence, which could account for a wider range of surfaces. One way to handle multiple models is to have a mixture of distributions (McLachlan and Basford 1988). In this section, we introduce a new way of employing mixture models to account for a greater variety of situations. We extend the learning procedure described in the previous section based on these models.
3.1 Throwing out Discontinuities. If the surface is continuous, the depth at one patch can be accurately predicted from the depths of two patches on either side. If, however, the training data contains cases in
Learning Mixture Models
271
which there are depth discontinuities (see Fig. 2) the interpolator will also try to model these cases and this will contribute considerable noise to the interpolating weights and to the depth estimates. One way of reducing this noise is to treat the discontinuity cases as outliers and to throw them out. Rather than making a hard decision about whether a case is an outlier, we make a soft decision by using a mixture model. For each training case, the network compares the locally extracted depth, d,, with the depth predicted from the nearby context, d,. It assumes that d, - d, is drawn from a zero-mean gaussian if it is a continuity case and from a uniform distribution if it is a discontinuity case, as shown in Figure 3. It can then estimate the probability of a continuity case: (3.1) where N is a gaussian, and kdiscont is a constant representing a uniform density.’ We can now optimize the average information d, and d, transmit about their common cause. We assume that no information is transmitted in discontinuity cases, so the average information depends on the probability of continuity and on the variance of d, + dc and d, - d, measured only in the continuity cases: (3.2) where Pcont = (pcont(dc - 4)). We tried several variations of this mixture approach. The network is quite good at rejecting the discontinuity cases, but this leads to only a modest improvement in the performance of the interpolator. In cases where there is a depth discontinuity between d, and db or between dd and d, the interpolator works moderately well because the weights on d, or d, are small. Because of the term Pcontin equation 3.2 there is pressure to include these cases as continuity cases, so they probably contribute noise to the interpolating weights. In the next section we show how to avoid making a forced choice between rejecting these cases or treating them just like all the other continuity cases. ‘We empiccally select a good (fixed)value of kdixont, and w,e choose a starting value of ccont(dc - d c ) (some proportion of the initial variance of d, -dJ, and gradually shrink it during learning. The learning algorithm’s performance is fairly robust with respect to variations in the choice of kdk& the main effect of changing this parameter is to sharpen or flatten the network‘s probabilistic decision function for labeling cases as continuous or discontinuous (equation 3.1). The choice of Vmnt(dc- &), on the other hand, turns out to affect the learning algorithm more critically; if this variance is too small, many cases will be treated as discontinuous, and the network may converge to very large weights which overfit only a small subset of the training cases. There is no pmblem, however, if this variance is too large initially; in this case, all patterns are h a t e d as continuous, and as the variance is shrunk during learning, some discontinuous cases are eventually detected.
272
Suzanna Becker and Geoffrey E. Hinton
Figure 2: (Top) A curved surface strip with a discontinuity created by fitting 2 cubic splines through randomly chosen control points, 25 pixels apart, separated by a depth discontinuity. Feature points are randomly scattered on each spline with an average of 0.22 features per pixel. (Bottom)A stereo pair of “intensity” images of the surface strip formed by taking two different projections of the feature points, filtering them through a gaussian, and sampling the filtered projections at evenly spaced sample points. The sample values in corresponding patches of the two images are used as the inputs to a module. The depth of the surface for a particular image region is directly related to the disparity between corresponding features in the left and right patch. Disparity ranges continuously from -1 to +1 image pixels. Each stereo image was 120 pixels wide and divided into 10 receptive fields 10 pixels wide and separated by 2 pixel gaps, as input for the networks shown in Figure 1. The receptive field of an interpolating unit spanned 58 image pixels, and discontinuities were randomly located a minimum of 40 pixels apart, so only rarely would more than one discontinuity lie within an interpolator’s receptive field.
3.2 Learning a Mixture of Interpolators. The presence of a depth discontinuity somewhere within a strip of five adjacent patches does not necessarily destroy the predictability of depth across these patches. It may just restrict the range over which a prediction can be made. So instead of throwing out cases that contain a discontinuity, the network could try to develop a number of different, specialized models of spatial coherence across several image patches. If, for example, there is a depth discontinuity between d, and d, in Figure lb, an extrapolator with weights of -1.0, $2.0, 0, 0 would be an appropriate predictor of d,. The network could also try to detect the locations of discontinuities, and use this information as the basis for deciding which model to apply on a given case. This information is useful not only in making clean decisions about which coherence model to apply, but it also provides valuable cues for interpreting the scene by indicating the locations of object boundaries in the image. Thus, we can use both the interpolated depth map, as well
Learning Mixture Models
273
A
continuity
d:
d:
Figure 3: The probability distribution of d,, P l ( d c ) , is modeled as a mixture of two distributions: a gaussian with mean = dr and small variance, and P 2 ( d c ) , a uniform distribution. Sample points for d, and d,, @ and dr are shown. In this case, $ and d t are far apart so $ is more likely to have been drawn from Pz. as the locations of depth discontinuities, in subsequent stages of scene interpretation. A network can learn to discover multiple coherence models using a set of competing interpolators. Each interpolator tries, as before, to achieve high agreement between its output and the depth extracted locally by a module. Additionally, each interpolator tries to account for as many cases as possible by maximizing the probability that its model holds. The objective function maximized by the network is the sum over models, i, of the agreement between the output of the ith model, &,, and the predicted depth, d,, weighted by the probability of the ith model:
(3.3) where the V’s represent variances given that the ith model holds. The probability that the ith model is applicable on each case a, pp, can be computed independently of how well the interpolators are doing;’ this can be done by adding extra “controller” units to the network, as shown in Figure 4, whose sole purpose is to compute the probability, pi, that each interpolator’s model holds. The weights of both the controllers and the interpolating experts can be learned simultaneously, so as to maximize *More precisely, this computed probability is conditionally independent of the interpolators’ performance on a particular case, with independence being conditioned on a fixed set of weights. As the reviewer has pointed out, when the weights change over the course of learning, there is an interdependence between the probabilities and interpolated quantities via the shared objective function.
Suzanna Becker and Geoffrey E. Hinton
274
controller 1 exoert 2
controller 2
controller 4 expert 5
controller 5
Figure 4 An architecturefor learning a mixture model of curved surfaces with discontinuities, consisting of a set of interpolators and discontinuity detectors. We actually used a larger modular network and equaIity constraints between the weights of correspondingunits in different modules, with 6 copies of the architectureshown here. Each copy received input from different but overlapping parts of the input. I”. By assigning a controller to each expert interpolator, each controller should learn to detect a discontinuity at a particular location (or the absence of a discontinuity in the case of the interpolator for pure continuity cases). And each interpolating unit should learn to capture the particular type of coherence that remains in the presence of a discontinuity at a particular location. The outputs of the controllers are normalized, so that they represent a probability distribution over the interpolating experts’ models. We can think of these normalized outputs as the probability with which the system selects a particular expert. Each controller’s output is a normalized exponential function of its squared total input, xi: (3.4)
Squaring the total input makes it possible for each unit to detect a depth edge at a particular location, independently of the direction of contrast change. We normalize the s uared total input in the exponential by an estimate of its variance, 6(x$ = k w,:. (This estimate of the variance of the total weighted input is exact if the unweighted individual inputs are independent, gaussian, and have equal variances of size k.) This discourages any one unit from trying to model all of the cases simply by having huge weights. The controllers get to see all five local depth estimates, da . . .d,. As before, each interpolating expert computes a h e a r function of four contextually extracted depths, d i c = Wbda +Wi& wddd wiede, in order to try to predict the centrally extracted depth d,.
cji
+
+
Learning Mixture Models
275
We first trained the network using the original continuous model, as described in Section 2, on a training set of 1000 images with discontinuities, until the lower layers of the network became well tuned to depth. So the interpolators were initially pretrained using the continuity model, and all the interpolators learned similar weights. We then froze the weights in the lower layers, added a small amount of noise to the interpolators’ weights (uniform in [-0.1,0.1]), and applied the mixture model to improve the interpolators and train the controller units. We ran the learning procedure for 10 runs, each run starting from different random initial weights and proceeding for 10 conjugate gradient learning iterations. The network learned similar solutions in each case. A typical set of weights on one run is shown in Figure 5. The graph on the right in this figure shows that four of the controller units are tuned to discontinuities at different locations. The weights for the first interpolator (shown in the top left) are nearly symmetrical, and the corresponding controller’s weights (shown immediately to the right) are very small; the graph on the right shows that this controller (shown as a solid line plot) mainly responds in cases when there is no discontinuity. The second interpolator (shown in the left column, second from the top) predominantly uses the leftmost three depths; the corresponding controller for this interpolator (immediatelyright of the top left interpolator’s weights) detects discontinuitiesbetween the rightmost two depths, d, and d d . Similarly, the remaining controllers detect discontinuities to the right or left of d,; each controller’s corresponding interpolator uses the depths on the opposite side of the discontinuity to predict d,. 4 Discussion
We have described two ways of modeling spatially coherent features in images of scenes with discontinuities. The first approach was to simply try to discriminate between patterns with and without discontinuities, and throw away the former. In theory, this approach is promising, as it provides a way of making the algorithm more robust against outlying data points. We then applied the idea of multiple models of coherence to a set of interpolating units, again using images of curved surfaces with discontinuities. The competing controllers in Figure 4 learned to explicitly represent which regularity applies in a particular region. The output of the controllers was used to compute a probability distribution over the various competing models of coherence. The representation learned by this network has a number of advantages. We now have a measure of the probability that there is a discontinuity that is independent of the prediction error of the interpolator. So we can tell how much to trust each interpolator’s estimate on each case. It should be possible to distinguish clear cases of discontinuities from cases that are simply noisy, by the entropy of the controllers’ outputs.
276
Suzanna Becker and Geoffrey E. Hinton
Figure 5 (a) Typical weights learned by the five competing interpolators and correspondingfive discontinuitydetectors. Positive weights are shown in white, and negative weights in black. (b) The mean probabilities computed by each discontinuity detector are plotted against the distance from the center of the units’ receptive field to the nearest discontinuity. The probabilistic outputs are averaged over an ensemble of lo00 test cases. If the nearest discontinuity is beyond f30 pixels, it is outside the units’ receptive field and the case is therefore a continuity example.
Furthermore, the controller outputs tell us not only that a discontinuity is present, but exactly where it lies. This information is important for segmenting scenes, and should be a useful representation for later stages of unsupervised learning. Like the raw depth estimates, the location of depth edges should exhibit coherence across space, at larger spatial scales. It should therefore be possible to apply the same algorithm recursively to the outputs of the controllers, to find object boundaries in two-dimensional stereo images. The approach presented here should be applicable to other domains that contain a mixture of alternative local regularities across space or time. For example, a rigid shape causes a linear constraint between the locations of its parts in an image, so if there are many possible shapes, there are many alternative local regularities (Zemel and Hinton 1991).
Learning Mixture Models
277
Acknowledgments
This research was funded by grants from NSERC and the Ontario Information Technology Research Centre. Hinton is Noranda fellow of the Canadian Institute for Advanced Research. Thanks to John Bridle and Steve Nowlan for helpful discussions.
References Becker, S., and Hinton, G. E. 1992a. Learning to make coherent predictions in domains with discontinuities. In Advances in Neural Information Processing Systems 4. Morgan Kaufmann, San Mateo, CA. Becker, S., and Hinton, G. E. 1992b. A self-organizing neural network that discovers surfaces in random-dot stereograms. Nature (London) 355, 161163. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. 1991. Adaptive mixtures of local experts. Neural Cornp. 3(1), 79-87. McLachlan, G. J., and Basford, K. E. 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. Shannon, C. E., and Weaver, W. 1964. The Mathematical Theory of Communication. The University of Illinois Press, Urbana, IL. Zemel, R. S., and Hinton, G. E. 1991. Discovering viewpoint-invariant relationships that characterize objects. In Advances in Neural Information Processing Systems 3, pp. 299-305. Morgan Kaufmann, San Mateo, CA.
Received 5 June 1992; accepted 26 June 1992.
This article has been cited by: 2. K. Lee, H. Buxton, J. Feng. 2005. Cue-Guided Search: A Computational Model of Selective Attention. IEEE Transactions on Neural Networks 16:4, 910-924. [CrossRef] 3. Ryotaro Kamimura, Taeko Kamimura, Thomas R. Shultz. 2001. Information Theoretic Competitive Learning and Linguistic Rule Acquisition. Transactions of the Japanese Society for Artificial Intelligence 16, 287-298. [CrossRef] 4. Suzanna Becker, Mark Plumbley. 1996. Unsupervised neural network learning procedures for feature extraction and classification. Applied Intelligence 6:3, 185-203. [CrossRef] 5. Suzanna Becker. 1996. Network: Computation in Neural Systems 7:1, 7-31. [CrossRef]
Communicated by Steve Suddarth
Hints and the VC Dimension Yaser S. Abu-Mostafa California lnstitute of Technology, Pasadena, CA 91225 USA
Learning from hints is a generalization of learning from examples that allows for a variety of information about the unknown function to be used in the learning process. In this paper, we use the VC dimension, an established tool for analyzing learning from examples, to analyze learning from hints. In particular, we show how the VC dimension is affected by the introduction of a hint. We also derive a new quantity that defines a VC dimension for the hint itself. This quantity is used to estimate the number of examples needed to "absorb" the hint. We carry out the analysis for two types of hints, invariances and catalysts. We also describe how the same method can be applied to other types of hints. 1 Introduction Learning from examples deals with an unknown function f that is represented by examples to the learning process. The process uses the examples to infer an approximate implementation off. Learning from hints (Abu-Mostafa 1990) generalizes the situation by allowing other information that we may know about f to be used in the learning process. Such information may include invariance properties, symmetries, correlated functions (Suddarth and Holden 19911, explicit rules (Winand Giles 1992),minimum-distance properties (Al-Mashouq and Reed 1991), or any other fact about f that narrows down the search. In many practical situations, we do have some prior information about f, and the proper use of this information (instead of just using "blind examples off) can make the difference between feasible and prohibitive learning. In this paper, we develop a theoretical analysis of learning from hints. The analysis is based on the VC dimension (Blumer et al. 19891, which is an established tool for analyzing learning from examples. Simply stated, the VC dimension VC(G) furnishes an upper bound for the number of examples needed by a learning process that starts with a set of hypotheses G about what f may be. The examples guide the search for a hypothesis g E G that is a good replica off. Since f is unknown to begin with, we start with a relatively big set of hypotheses G to maximize our chances of finding a good approximation off among them. However, the larger G is, the more examples off we Neural Computation 5,278-288 (1993) @ 1993 Massachusetts Institute of Technology
Hints and the VC Dimension
279
need to pinpoint the good hypothesis. This is reflected in a bigger value of V C ( G ) . How do we make G smaller without the risk of losing good approximations off? This is where the hints come in. Since a hint is a known property of f, we can use it as a litmus test to weed out bad gs thus shrinking G without losing good hypotheses. The main result of this paper is the application of the VC dimension to hints in two forms. 1. The VC dimension provides an estimate for the number of examples needed to learn f. When a hint H is given about f , the number of examples off can be reduced. This is reflected in a smaller “VC dimension given the hint’’ V C ( G I H). 2. If H itself is represented to the learning process by a set of examples,
we would like to estimate how many examples are needed to absorb the hint. This calls for a generalization of the VC dimension to cover examples of the hint as well as examples of the function, which is reflected in a “VC dimension for the hint” V C ( G ; H ) . We will study two types of hints in particular, invariances and catalysts. We will discuss how the same framework can be used to study other types of hints. A detailed account of the VC dimension can be found in (Blumer et al. (1989) and Vapnik and Chervonenkis (1971). We will provide a brief background here to make the paper self-contained. The setup for learning from examples consists of an environment X and an unknown function f : X ---t ( 0 , l ) that we wish to learn. The goal is to produce a hypothesis g : X -, (0,l) that approximates f . To do this, the learning process starts with a set of hypotheses G and tries to select a good g E G based on a number of examples [ x l , f ( x l ) ]; . . . ; [xN,f(xN)]off. To generate the examples, we assume that there is a probability distribution P ( x ) on the environment X. Each example is picked independently according to P ( x ) . The hypothesis g that results from the learning process is considered a good approximation off if the probability [wxt. P(x)l that g ( x ) # f ( x ) is small. The learning process should have a high probability of producing a good approximation off when a sufficient number of examples is provided. The VC dimension helps determine what is “sufficient.” Here is how it works. Let rg= Pr[g(x) = f ( x ) ] ,where Pr[.] denotes the probability of an event. We wish to pick a hypothesis g that has rgx 1. However, f is unknown and thus we do not know the values of these probabilities. Since f is represented by examples, we can compute the frequency of agreement between each g and f on the examples and base our choice of g on the frequencies instead of the actual probabilities. Let hypothesis g agree with f on a fraction vg of the examples. We pick a hypothesis that has vg M 1. The VC inequality asserts that the values of vgs will be close to rgs. Specifically, r
1
Yaser S. Abu-Mostafa
280
where “sup” denotes the supremum, and rn is the growth function of G. m(N)is the maximum number of different binary vectors g(x1) . - . g ( x ~ ) that can be generated by varying g over G while keeping XI, . . . ,X N E X fixed. Clearly, m(N)5 2N for all N. The VC dimension V C ( G )is defined as the smallest N for which m(N) < 2N. We assume that G has a finite VC dimension. If VC(G)= d, the growth function m(N)can be bounded bY
When this estimate is substituted in the VC inequality, the right-hand side of the inequality becomes arbitrarily small for sufficiently large N. This means that it is almost certain that each vgis approximately the same as the corresponding 7rg. This is the rationale for considering N examples sufficient to learn f . We can afford to base our choice of hypothesis on vg as calculated from the examples, because it is approximately the same as 7rg. How large N needs to be to achieve a certain degree of approximation is affected by the value of the VC dimension. In this paper, we assume that f E G. This means that G is powerful enough to implement f . We also assume that f strictly satisfies the hint H. This means that f will not be excluded as a result of taking H into consideration. Finally, we assume that everything that needs to be measurable will be measurable. 2 Invariance Hints
It is often the case that we know an invariance property of an otherwise unknown function. For example, speaker identification based on a speech waveform is invariant under time shift of the waveform. Properties such as shift invariance and scale invariance are commonplace in pattern recognition, and dozens of methods have been developed to take advantage of them (e.g., Hu 1969). Invariances have also been used in neural networks, for example, group invariance of functions (Minsky and Papert 1988) and the use of invariances in backpropagation (Abu-Mostafa 1990). An invariance hint H can be formalized by the partition
x=ux, (2
of the environment X into the invariance classes X,, where a is an index. Within each class X,, the value off is constant. In other words, x , x‘ E X, implies that f ( x ) = f (x‘) . Some invariance hints are “strong” and others are “weak,“ and this is reflected in the partition X = U, X,. The finer the partition, the weaker the hint. For instance, if each X, contains a single point, the hint is extremely weak (actually useless) since the information that x,x‘ E X,
Hints and the VC Dimension
281
implies that f ( x ) = f ( x ' ) tells us nothing new as x and x' are the same point in this case. On the other extreme, if there is a single X, that contains all the points (X, = X), the hint is extremely strong as it forces f to be constant over X (either f = 1 or f = 0). Practical hints, such as scale invariance and shift invariance, lie between these two extremes. In what follows, we will apply the VC dimension to an invariance hint H. We will start by assessing the impact of H on the original VC dimension. We will then focus on representing H by examples and address what an example of H is, how to define a VC dimension for H, and what it means to approximate H. Finally, we will discuss relations between different VC dimensions.
2.1 How the Hint Affects VC(G). The VC dimension is used to estimate the number of examples needed to learn an unknown function f . It is intuitive that, with the benefit of a hint about f, we should need fewer examples. To formalize this intuition, let the invariance hint H be given by the partition X = U, X,. Each hypothesis g E G either satisfies H or else does not satisfy it. Satisfying H means that whenever x, x' E X,, then g ( x ) = g(x'). The set of hypotheses that satisfy H is G G = {g E G I x,X' EX, + - g ( x ) = g ( x ' ) }
G is a set of hypotheses and, as such, has a VC dimension of its own.
This is the basis for defining the VC dimension of G given H VC(G 1 H ) = VC(G) Since G C_ G, it follows that VC(G I H) 5 VC(G).Nontrivial hints lead to a significant reduction from G to G, resulting in VC(G [ H) < VC(G). On the other hand, some hints may have VC(G I H) = VC(G). For instance, in the case of the weak hint we talked about, every g trivially satisfies the hint, hence G = G. VC(G 1 H ) replaces VC(G) following the "absorption" of the hint. Without the hint, VC(G)provides an estimate for the number of examples needed to learn f . With the hint, VC(G 1 H) provides a new estimate for the number of examples. This estimate is valid regardless of the mechanism for absorbing the hint, as long as it is completely absorbed. If, however, the hint is only partially absorbed (which means that some gs that do not strictly satisfy the invariance are still allowed), the effective VC dimension lies between VC(G)and VC(G I H). 2.2 Representing the Hint by Examples. What is an example of an invariance hint? If we take the hint specified by X = U, X,, an example would be " f ( x ) = f(x')," where x and x' belong to the same invariance class. In other words, an example is a pair ( x , x ' ) that belong to the same X,.
Yaser S. Abu-Mostafa
282
The motivation for representing a hint by examples is twofold. The hint needs to be incorporated in what is already a learning-from-examples process. The example f ( x ) = f(x') can be directly included in descent methods such as backpropagation along with examples of the function itself. To do this, the quantity [g(x) - g(x')I2 is minimized the same way [g(x) - f ( x ) l Zis minimized when we use an example o f f . In addition, we may represent a hint by examples if it cannot be easily expressed as a global mathematical constraint. For instance, invariance under elastic deformation of images does not readily yield an obvious constraint on the weights of a feedforward network. In contrast to the function f that is represented by a controlled number of examples and is otherwise unknown,a hint can be represented by as many examples as we wish, since it is a known property and hence can be used indefinitely to generate examples. Examples of the hint, like examples of the function, are generated according to a probability distribution. One way to generate ( x , x ' ) is to pick x from X according to the probability distribution P ( x ) , then pick x' from X, (the invariance class that contains x ) according to the conditional probability distribution P(x' 1 Xa).A sequence of N (pairs of) examples (XI, x:); (XZ, 2');.. .;( X N , YN)would be generated in the same way, independently from pair to pair. 2.3 A VC Dimension for the Hint. As we discussed in the introduction, the VC inequality is used to estimate how well f is learned. We wish to use the same inequality to estimate how well H is absorbed. To do this, we transform the situation from hints to functions. This calls for definitions of new X, P, G, and f. Let H be the invariance hint X = U,X,. The new environment is defined by
x=ux; a
(pairs of points coming from the same invariance class) with the probability distribution described above P ( x , 2)= P(x)P(x' 1 X,) where X, is the class that contains x (hence contains x'). The new set of hypotheses G, defined on the environment X, contains a hypothesis g for every hypothesis g E G such that
and the function to be "learned" is f(x,x') = 1
Hints and the VC Dimension
283
The VC dimension of the set of hypotheses G is the basis for defining a VC dimension for the hint. VC(G;H ) = VC(G) VC(G;H)depends on both G and H since G is based on G and the new environment X (which in turn depends on HI. 2.4 Approximation of the Hint. If the above learning process resulted in the hypothesis g = f (the constant 11, the corresponding g E G would obviously satisfy the hint. Learning from examples, however, results only in g that approximates f well (with high probability). The approximation is in terms of the distribution P(x,x') used to generate the examples. Thus, w.r.t. to P, Pr[g(x,x') # 11 0 as the number of examples N becomes large. Can we translate this statement into a similar one based only on the original distribution P(x)? To do this, we need to rid the statement of x'. Let --f
P r W # g(x')l = 7 By definition of g, Pr[g(x,x') # 11 is the same as Pr[g(x) # g(x')]. This implies that y 0 as N + 00. In words, if we pick x and x' at random according to P(x,x'), the probability that our hypothesis will have different values on these two points is small. To get rid of x' from this statement, we introduce hint-satisfying versions of the gs. For each g E G, let g be the best approximation of g that strictly satisfies the hint. This means that, within each invariance class X,, g(x) is constant and its value is the more probable of the two values of g(x) within X, (ties are broken either way). We will argue that P r W # g(4l I Y Since y + 0, this statement [which is solely based on P ( x ) ] implies that "g approximately satisfies the hint" in a more natural way. Here is the argument. Let q be the probability that g(x) # g(x). Given X,, let be the conditional probability that g(x) # g ( x ) , and let ya be the conditional probability that g(x) # g(x'). From the definition of g, q, must be I (otherwise, the value of g in X, should be flipped). Within each X,, since g is constant, g(x) # g(x') if, and only if, g agrees with g on either x or x' and disagrees on the other. This means that
+,
1
Ya = 2%(1 -%)
2 ?a (since 1 - q, 2 f). This is true for every class X,. Averaging over a, we get y 2 9, hence Prk(x) #Wl = 3 5 7 4
0
This establishes the more natural notion of approximating the hint.
Yaser S. Abu-Mostafa
284
2.5 A Bound on VC(G;H). As in the case of the set G and its growth function m ( N ) , the VC dimension V C ( G ; H )= V C ( G )is defined based on the growth function m(N)of the set G. m(N) is the maximum number of patterns of 1s and 0s that can be obtained by applying the g’s to (fixed but arbitrary) N examples (x,, x i ) ; (XZ, x i ) ; . . .;( X N , xh). VC(G;H ) is the smallest N for which m(N)< 2N. The value of V C ( G ; H )will differ from hint to hint. Consider our two extreme examples of weak and strong hints. The weak hint has VC(G;H ) as small as 1 since each g always agrees with each example of the hint [hence every g is the constant 1, and m(N)= 1 for all NI. The strong hint has V C ( G ; H )as large as it can be. How large is that? In Fyfe (19921, it is shown that for any invariance hint H ,
V C ( G ; H )< X V C ( G ) where X = 4.54. The argument goes as follows. For each pattern generated by the g’s on x1, x{i x2i x;i . . . X N , there is at most one distinct pattern generated by the g’s on 9
(XI 1
4
1; (x2,d);.. ‘ ;(XN,xk)
because g ( X n 1 xb) is uniquely determined by g ( x n ) and g(x‘,). Therefore,
m(N)5 m(2N) If V C ( G )= d, we can use Chernoff bounds (Feller 1968)to estimate m(2N) for N 2 d as follows
< - 2U(d/Wx2N where X ( 8 ) = -0 log, B - (1- 0 ) log,( 1- 6 ) is the binary entropy function. Therefore, once X(dI2N) 5 m(N) will be less than 2N and N must have reached, or exceeded, the VC dimension of G. This happens at N l d x 4.54. In many cases, the relationship between V C ( G I H ) and V C ( G ; H ) can be roughly stated as follows: the smaller one is, the bigger the other is. Strong hints generally result in a small value of VC(G I H) and a large value of V C ( G ; H ) ,while weak hints result in the opposite situation [the loose similarity with the average mutual information I(X;Y) and the conditional entropy H(X 1 Y) in information theory is the reason for choosing this notation for the various VC dimensions]. This relationship between V C ( G I H ) and V C ( G ; H )may suggest that we do not save when we use examples of a hint and, as a result, use fewer examples of the function. However, it should be noted that examples of the hint can be generated at will, while examples of the function may be limited in number or expensive to generate.
it
Hints and the VC Dimension
285
3 Catalyst Hints
Catalyst hints (Suddarth and Holden 1991) were introduced as a means of improving the learning behavior of feedfonvard networks. The idea is illustrated in Figure 1. A network attempting to learn the function g = f is augmented by a catalyst neuron out of the last hidden layer. This neuron is trained to learn a related function g' = f'. In doing so, the hidden layers of the network are influenced in a way that helps the main learning task g = f . After the learning phase is completed, the catalyst neuron is removed. The catalyst function f' is typically a "well-behaved version" of f that can be learned more easily and more quickly. When f' is learned, the internal representations in the hidden layers of the network will be suited for the implementation of the main function f. As a hint, namely a piece of information about f l the catalyst is the assertion that there is a way to set the weights of the network that simultaneously implements g = f and g' = f'. Unlike invariances, catalysts are very particular to the network we use. To formalize the catalyst hint, let 0 be the set of pairs of hypotheses (g,g') that can be simultaneously implemented by the network (when the catalyst neuron is present). The values of the weights in the different
Figure 1: A network that uses a catalyst hint.
Yaser S. Abu-Mostafa
286
layers of the network determine (g,g‘). A particular g may appear in different pairs (8, g‘) and, similarly, a particular g‘ may appear in different pairs (g,g‘). Since the catalyst hint puts a condition on g‘, its impact on g is indirect through these pairings of (g,g‘). This suggests the following notation: (g,8’) denotes the hypothesis g when the catalyst hypothesis is g‘ and (g,g’) denotes the hypothesis g‘ when the main hypothesis is g. Applied to a point x E X, we use the convention (g,gO(x) = g(x) (g,g’)(x) = g’(4
Thus (g,g‘) and (g,8’) provide an inflated notation for the hypotheses g and g‘, respectively. In these terms, the set of hypotheses G is defined by G = {(g,g’) I (g,g’) E 8)
To apply the VC dimension to catalyst hints, we will follow the same steps we used for invariance hints. The catalyst hint H is given by the constraint g‘ = f’. When H is absorbed, G is reduced to G
Obviously, G E G. The VC dimension of G given H is VC(G I H ) = VC(G)
Again, VC(G I H) 5 VC(G). How small VC(G I H) will be depends on the catalyst function f’. For instance, the degenerate case of a constant f’ results in VC(G I H) = VC(G) since the constant can be implemented by the catalyst neuron alone and would not impose any constraint on the weights of the original network. On the other hand, a complex f’ will take specific combinations of weights to implement, thus significantly restricting the network and resulting in VC(G I H) VC(G). If the hint is only partially absorbed, the effective VC dimension lies between VC(G) and VC(G I H). One situation that leads to partial absorption is when the hint is represented by examples. An example of the hint H: g‘ = f’ takes the form g’(x) = f’(x). In this case, examples of H are of the same nature as examples off; x is picked from X according to P ( x ) and f ’ ( x ) is evaluated. The definition of examples of H leads to the definition of G, the set of agreement/disagreement patterns between the hypotheses and the hint. For each hypothesis (g,g’) E G, there is a hypothesis g E G such that
Hints and the VC Dimension
287
The VC dimension of G is the basis for defining VC(G;H), the VC dimension that will indicate how many examples [x,f’(x)] are needed to absorb H. It is given by
VC(G;H)= V C ( G ) Unlike an invariance hint, the particular choice of a catalyst hint (the function f’) does not affect the value of VC(G;H). The VC inequality asserts that a sufficient number of examples will lead to a hypothesis (g,g‘) that satisfies
W(g, g W ) # fW1
+
0
where the probability is taken wxt. P ( x ) . Therefore, we will get a hypothesis g that pairs up with a good approximation off’. This establishes a natural notion of approximating the hint. 4 Conclusion
We have analyzed two different types of hints, invariances and catalysts. The highlight of the analysis is the definition of VC(G I H ) and VC(G;H). These two quantities extend the VC inequality to cover learning f given the hint, and learning the hint itself. Other types of hints can be quite different from invariances and catalysts, and will require new analysis. However, the common method for dealing with any type of hint in this framework is as follows. 1. The definition of the hint should determine for each hypothesis in G whether or not it satisfies the hint. The set G contains those hypotheses which do satisfy the hint. V C ( G 1 H) is defined as VC(G). 2. A scheme for representing the hint by examples should be selected.
Each example is generated according to a probability distribution P that depends on the original distribution P. Different examples are generated independently according to the same distribution. 3. For every hypothesis and every example of the hint, we should be able to determine whether or not the hypothesis agrees with the example. The agreement/disagreement patterns define the set of hypotheses G,and V C ( G ) defines VC(G;H). A hypothesis will agree with every possible example if, and only if, it satisfies the hint. 4.
How well a hypothesis approximates the hint is measured by the probability (w.r.t. P) that it will agree with a new example. An approximation in this sense should imply a partial absorption of the hint.
288
Yaser S. Abu-Mostafa
5. How the hint is represented by examples may not be unique. The choice of representation affects the definition of VC(G;H) and also affects what partial absorption means. A minimum consistency requirement is that no hypothesis that strictly satisfies the hint should be excluded as a result of the partial absorption process. A good process will exclude as many hypotheses as possible without violating this requirement.
Our analysis here dealt with the situation where the unknown function f strictly satisfies the hint, and strictly belongs to G. Relaxing these conditions is worth further investigation. It is also worthwhile to extend this work to cover real-valued functions, as well as average-case measures instead of the worst-case VC dimension. Finally, schedules for mixing examples off with examples of the hint in learning protocols are worth exploring.
Acknowledgment This work was supported by AFOSR Grant 92-J-0398 and the FeynmanHughes fellowship. The author wishes to thank Dr.Demetri Psaltis for a number of useful comments.
References Abu-Mostafa, Y. 1990. Learning from hints in neural networks. 1. Complex. 6, 192-1 98. Al-Mashouq, K.,and Reed, I. 1991. Including hints in training neural networks. Neural Comp. 3, 418427. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. 1989. Learnability and the Vapnik-Chervonenkisdimension. 1.ACM 36,929-965. Feller, W. 1968. An Introduction to Probability Theory and Its Applications, Vol. 2. Wiley, New York. Fyfe, A. 1992. Invariance hints and the VC dimension. Ph.D. Thesis, Caltech. Hu, M. 1962. Visual pattern recognition by moment invariants. IRE Trans. Inform. Theory IT-8, 179-187. Minsky, M.,and Papert, S. 1988. Perceptrons, expanded edition. MIT Press, Cambridge, MA. Omlin, C., and Giles, C. 1992. Training second-order recurrent neural networks using hints. In Machine Learning: Proceedings of the Ninth lnternational Conference (ML-92), D. Sleeman and P. Edwards, eds. Morgan Kaufmann, San Mateo, CA. Suddarth, S., and Holden, A. 1991. Symbolic neural systems and the use of hints for developing complex systems. Intl. 1.Machine Stud. 35,291. Vapnik, V., and Chervonenkis, A. 1971.On the uniform convergence of relative frequencies of events to their probabilities. Theory Probabil.Awl. 16,264-280. Received 21 April 1992; accepted 15 July 1992.
This article has been cited by: 2. Eshel Faraggi, Bin Xue, Yaoqi Zhou. 2009. Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins: Structure, Function, and Bioinformatics 74:4, 847-856. [CrossRef] 3. J. Ghosn, Yoshua Bengio. 2003. Bias learning, knowledge sharing. IEEE Transactions on Neural Networks 14:4, 748-765. [CrossRef] 4. E.A. Rietman, S.A. Whitlock, M. Beachy, A. Roy, T.L. Willingham. 2001. A system model for feedback control and analysis of yield: A multistep process model of effective gate length, poly line width, and IV parameters. IEEE Transactions on Semiconductor Manufacturing 14:1, 32-47. [CrossRef] 5. P. Niyogi, F. Girosi, T. Poggio. 1998. Incorporating prior information in machine learning by creating virtual examples. Proceedings of the IEEE 86:11, 2196-2209. [CrossRef] 6. E.A. Rietman, D.J. Friedman, E.R. Lory. 1997. Pre-production results demonstrating multiple-system models for yield analysis. IEEE Transactions on Semiconductor Manufacturing 10:4, 469-481. [CrossRef] 7. Yaser S. Abu-Mostafa . 1995. HintsHints. Neural Computation 7:4, 639-671. [Abstract] [PDF] [PDF Plus] 8. Vladimir Vapnik , Esther Levin , Yann Le Cun . 1994. Measuring the VC-Dimension of a Learning MachineMeasuring the VC-Dimension of a Learning Machine. Neural Computation 6:5, 851-876. [Abstract] [PDF] [PDF Plus]
Communicated by Ralph Linsker
Redundancy Reduction as a Strategy for Unsupervised Learning A. Norman Redlich The Rockefeller University,2230 York Ave., New York, NY 10021 USA A redundancy reduction strategy, which can be applied in stages, is proposed as a way to learn as efficiently as possible the statistical properties of an ensemble of sensory messages. The method works best for inputs consisting of strongly correlated groups, that is features, with weaker statistical dependence between different features. This is the case for localized objects in an image or for words in a text. A local feature measure determining how much a single feature reduces the total redundancy is derived which turns out to depend only on the probability of the feature and of its components, but not on the statistical properties of any other features. The locality of this measure makes it ideal as the basis for a ”neural” implementation of redundancy reduction, and an example of a very simple non-Hebbian algorithm is given. The effect of noise on learning redundancy is also discussed. 1 Introduction Given sensory messages, for example, the visual images available at the photoreceptors, animals must idenhfy those objects or scenes that have some value to them. This problem, however, can be very tricky since the image data (e.g., photoreceptor signals) may underdetermine the scene data (e.g., surface reflectances) needed to find and idenhfy objects (Kersten 1990). In the case of very primitive organisms crude special purpose filters may suffice, such as the “fly detector” in frogs. But for more general object detection and for the reconstruction of physical scenes from noisy image data, some additional clues or constraints are needed. One type of clue is knowledge of the statistical properties of scenes and images (Attneave 1954, Barlow 1961, 1989). Such information can be used to recover physical scene data from noisy image data, as shown for example by Geman and Geman (1984). Barlow (1989) has also argued that such information is necessary for object recognition, since it allows objects to be discriminated from irrelevant background data. Also, since objects are encoded redundantly in sensory messages, knowing this redundancy can aid in their recognition. But how can an organism go about learning the statistical properties of sensory messages? And second, what is the most efficient way of sforNeural Computation 5,289304 (1993) @ 1993 Massachusetts Institute of Technology
290
A. Norman Redlich
ing this statistical knowledge? The enormity of these problems becomes obvious when one considers just how many numbers in principle must be learned and stored. In vision this amounts to storing the probability of every possible set of pixel values in both space and time. For a conservative estimate of this number for humans, assume one million cones sampled in space only-temporal sampling would add considerably to this. Then assume a grey scale of roughly 100, which is less than the number of contrast units that can be discriminated in bright light-and ignores luminance data. This gives 100’@“)~ooo v b l e images whose probabilities could not possibly be stored as 100 wo*wo numbers in the brain, which has no more than 1OI6 synapses. However, there are two very important properties of images which allow this number to be decreased enormously. The first and most obvious is noise: most images differ from each other only by noise or by symmetries, so there is no need to learn and store their individual probabilities. The second s i m p w n g property is that sensory message probabilities can often be derived from a far smaller set of numbers. This is the case when the set of probabilities P(Z) for the images I = ( 1 1 , h,l 3 , . . .l,}, with pixel values I,, can be factorized into a far smaller set of statistically independent probabilities for the subimages ( 1 1 , Izr4,. . .I,} as P ( I ) = P(Il)P(12)P(13). . .P(Zm). Thus, as Barlow (1989) has emphasized, the most efficient way to store the probabilities P(Z) would be to find a transformation, a factorial code, from the pixel representation I, to the statistically independent representation I, with smallest m. It can be demonstrated (Atick and Redlich 1990a) that this explains one purpose of the retinal transfer function, which approximately removes (second-order) statistical dependence from the optic nerve outputs { I , , Zzr1 3 1 . . .Im}. Finding a transformation to a factorial representation is an unsupervised learning problem that typically requires many learning stages. At each stage I assume that only the local probabilities P(I,) are measured, but as statistical independence is “increased, products of these give better approximations to the joint probabilities P ( I ) . To quantdy just how statistically independent a representation is at each stage, it is necessary to define a global learning measure L,which should be a function only of the local probabilities P(l,) (global denotes a property of the entire representation at each stage). Such a measure is defined here based on the redundancy,’ a quantity that is minimal only when the code is factorial. Learning a redundancy reducing transformation at each stage can be very difficult and may depend on the nature of the redundancy at that stage. In the retina, the greatest source of redundancy is due to ‘I use “redundancy reduction” to refer to statistical dependence between pixels.
This is not strictly speaking the only source of redundancy, which also can come from the uneven probability distribution of grayscale values. Nevertheless, I use the term “redundancy reduction” because it im lies an information preserving transformation (unlike, e.g., “entropy reduction”) ancfalso because the word “redundancy” has an intuitive appeal. The precise meaning of redundancy reduction here is a transformation which increases the learning measure L,to be defined in Section 2.
Redundancy Reduction Strategy
291
multiscale second-order correlations between pixels-corresponding to scenes being very smooth over large spatial and temporal regions-and this redundancy can be removed easily through a linear filter (Atick and Redlich 1991). But this is the exception, since in general quite complicated nonlinear coding is required (for some progress see, e.g., Barlow and Foldiak 1989; Redlich 1992). However, there is one common type of nonlinear redundancy reduction that is relatively straightforward to learn. This is the redundancy in images coming from strong correlations within sharply delineated features which are in turn weakly correlated with each other (the features can be spatially extended as long as they decouple from other parts of the image). The procedure for factorizing in this case is to look first for the subfeatures that are most tightly bound, and therefore are responsible for the most redundancy. These may then be pieced together in stages, until eventually a statistically independent set is found. What makes this much simpler than I expected is the existence of a completely local measure of how much an individual feature (or subfeature) contibutes to the global redundancy. By local I mean that this measure is only a function of the probabilities of the feature and its components, but not of the probabilities of any other feature or components. The locality of this feature measure also allows simple implementation of redundancy reduction through unsupervised ”neural” learning algorithms. One such non-Hebbian algorithm will be discussed here, and compared to some other unsupervised algorithms (von der Malsburg 1973; Bienenstock et al. 1982; Hinton and Pearlmutter 1986). The closest connection is with Hinton and Pearlmutter’s algorithm because their single-unit feature measure is mathematically related to the one here, though this is manifest only in a particular approximation. This connection is not surprising since (see,e.g., Hinton and Sejnowski 1983) their aim was also to learn statistical regularities. Some of the major distinctions between this work and theirs are the focus here on efficiency of storage and learning (on statistical independence) and also the insistence here on transformations which preserve information (see Section 6). To demonstrate the power of the present approach, I apply it to strip the redundancy from English text-to learn the text’s statistical properties. This example is used because we all know a fairly good solution to the problem: transform from the letter representation to the word representation. Of course to make the problem sufficiently difficult all clues to the solution, such as spaces between words, punctuation, and capitalization, are first eliminated from the text. The algorithm eventually segments the text just as desired into words and tightly bound groups of words. Although it is not the main purpose of this paper, I shall also indicate how useful the algorithm can be for recovering messages from noisy signals. This works best when the useful information is coded redundantly while the noise is random. It then turns out that the algorithm used here
A. Norman Redlich
292
finds only the useful portion of the input, and this will be demonstrated using noisy text. Finally, I should emphasize that my aim is not to find redundancy in language or to claim that words are learned or stored in the brain as found here. Instead, my ultimate motivation is to find an environmentally driven, self-organizing principle for the processing of visual images (and other sensory signals) to facilitate object or pattern identification (see Redlich 1992). So by "words" here I always wish to imply visual features, with letter positions in a text corresponding to pixel locations in an image and particular letters corresponding to image grayscale (or color) values. The next step of applying the algorithms derived here to visual images will appear in future papers.
2 Global Learning Measure C
Taking the English text example, the input consists of an undifferentiated stream of "pixel" values L = {a, b, c, . . .} (as in Fig. la) and the goal is to learn the probability functions P(Z) = P(ll ,l p , /J, . . .l,), with the subscript n denoting the position of letter 1, in the text. In practice, the aim is to learn P ( I ) for string length n roughly equal to the correlation length of the system. But even for n as small as 12 this in principle requires storing and updating as many as 2612 numbers. To find P(Z) more efficiently, at each stage letters will be grouped together into "words," which at first will be only pieces of real English words. Then at successive stages the new set of words W = {w~, w2,w3,. . .wm} will be built by combining the previous "words" into larger ones, among which will be real English words and also tightly correlated groups of real words (from now on quotes around words are dropped). At the very first stage P ( I ) = P(ll,l 2 , l 3 , . . . In) is very poorly approximated by the product of letter probabilities P ( l 1 ) P ( l ~ ) P ( l 3. .) .P(ln), but as the redundancy is reduced, products of word probabilities P ( w I ) P ( w $ ( w ~ ) . . . P ( W m ) give better and better approximations to P(1). To quantitatively measure how well P(I) is known at each stage, we can use information theory (see also Barlow 1961) to define a global learning measure (2.1) where HLis the entropy in the original letter code,
HL = -
cPI log(P1) IEL
(2.2)
Redundancy Reduction Strategy
293
and Hw/S is the word entropy per letter at a particular stage in learning:
Hw
= -
c
Pwlog(Pw)
WEW
s
=
CP,SW
(2.3)
with 3 the average of the word lengths sw.
a~iccwasheginningtogetverytiredofsittinghyhersisteron thehankandofhavingnothingtodoonceortw iceshehadpeep edintothehookhersisterwasreadingbutithadnopi~tures~r conversationsinitandwhatistheuseofahookthoughtalicew ithoutpicturesorconversationssoshewasconsider inginhe r o w n m i n d a s w e l l a s s h e c o u l d f o r t h a h o t d a y m ad e h e r f e e l v e r y sleepyandstupidwhetherthepleasureofmakingada isychai nwouldheworthlhetrouhleofgettingupandpickinglhedaini eswhensuddenlyawhiterahhitwithpinkeyesranclosehyher t h e r e w a s n o t h i n g s o v e r y r e m a r k a b l e i n t h a t n o r d i d a l i c e thi n k i Is o v e r y m u c h o u t o f I h e w a y t o h e a r t h e ra h h i t s a y to i t s e If o hdearohdear (a)
alice was b e g in n ing t o g e t very I i r e d o f s i t t ing b y her s i s t e r 0 n the h a n k and o f h a v ing n o th ing t o d o o n c e o r t w i c e she h a d p e e p e d in t o the b o o k her s is I e r was r e a d ing hut i th a d n o p i c t u r e s o r c o n ver s a t i o n s in i t and w h a t i s the u s e o f a h o o k th ough t alice with ou t p i c t u r e s fl r c o n vet s a t i o n s s o she was c o n s i d e ring in hero w n m in d a s w e I I a s she c ould for the h o I d a y m a d e her f e e I very s I e e p y and s t u p i d w he the r the p 1 e a s u r e o f m a k ing a d a i s y c h a in w ould h e w o r th the t r ou h I e o f g e t t ing u p and p i c k ins the d a i s i e s w h e n s u d d e n 1 y a w h i t e r a h h i t with p in k e y e s r a n c I o s e h y her the r e was n o th ing s o very r e m a r k a h I e in that n o r d i dalic e th ink i t s o very m u c h ou t o fthe w a y t o he a r the r a h h i t s a y t o i t s e lfohdearohdear
(h) alice was beg in n ing to g e t very t i r ed of s it 1 ing h y her s i s t e r on the b a n k and of ha v ing n o thins to d o on c e o r I w i c e shehad p e e p ed in to the h ook her s i s t e r was r e a d ing hut i th a d n o p i c t u r e s o r c o n ver s a t i on s in it and what i s the u s e of a h ook thought alice with ou t p i c I u r e s o r c on ver s a t i on s s o she was c on s i d e r ing in her o w n m in d a s w e 1 I a s shecould for the h o t d a y m a d e her f e e I very s I e e p y and s t u p i d w he thu the p I e a s u r e of m a k ing a d a i s y ch a in would he w o r th the t r ou b 1 e of g e t t ing u p and p i c k ing the d a i s i e I when s u d d e n ly a whi t e r a h h it with p i n k e y e s r a n c I0 s e h y her her e was n o thing s o very r e m a r k a h I e in that n o r d'i dalic e think it s o very m uch ou t ofthe way to he a r ther a h h its a y to it s elf o h d e a r o h d e a r (C)
Figure 1: A small sample of the text with all clues to the redundancy removed. In (a) single letters are treated as words, indicated by the spaces between them, and the entropy is HL = 4.16 bits. As the redundancy is reduced letters are combined into words, indicated by removing spaces between them. Only some of the redundancy reduction stages are shown in (b)-(fl, with the entropy per letter reduced to Hw/S = 3.75, 3.46, 2.84, 2.51, and 2.35 bits, respectively (the real word entropy per letter is Hw/S = 2.17 bits). Continued.
A. Norman Redlich
294
alice was heg in n ing toge I very t ir ed of s it t ing hy her s is ter on the h an k and of having nothing to do on ce on w ice shehad p e e p ed intothe h ook her s is ter was read ing hut i thad no p i c tur e s or conversation s in it and what is the us e of a h ook thoughtalice without p i c tur e s or conversation s s o shewas con side r ing in her own m ind as well asshecould for the h o t day made her fee I very s I e e p y and s t up id w he ther the pleas ure of mak ing ad a is y ch a in would he wor th the t r ou hle of getting up and p i ck ing the d a is i e s when suddenly a whiterahhit with p in key e s r an close by her therewas nothing s o very re mark a hle in that no r d i dalic e think it Y o very much outofthe way to hear therahhit say to it self ohdear ohdear
(d) alice was heginn ing toget verytiredof s it I ing hy her s is tcr on the hank and of having nothing to do on ce ortwice shehad peeped intothe hook her s is ter was read ing hut ithad no picture s or conversation s in it and what is the us e of a hook thoughtalice without picture s or conversation s so shewas consider ing in her own mind aswell asshecould for the hot day made her fee I very sleep y and stupid whether the pleas ure of making ad a is y ch a in wouldhe wor th the t rouhle of getting up and p ick ing the d a is i e s when suddenly a whiterahhit with p in key e s r an close hy her therewas nothing so very remark a hle in that no r d i dalic e think it so very much outofthe way to hear therahhit say to it self ohdear ohdear (a)
alice was heginning toget verytiredof sitting hy her s is ter onthehank and of having nothingtodo onceortwice shehad peeped intothe hook her s is ter was read ing hut ithad no picturesor conversation s in it and what is the us e of a hook thoughtalice without picturesor conversation s so shewas consider ing in her own mind aswcll asshecould for the hot Jay made her feelvery sleepyand stupid whether the pleas ure of making ad a is y ch a in wouldhe wor th the t rouhle of getting up and p ick ing the d a is i e s when suddenly a whiterahhit with p in key e s r an close hy her therewas nothing so very remark able in that no r d i dalic e think it so very much outofthe way to hear therahhit say toitself ohdear ohdear
(f )
Figure 1: Continued.
Initially, the set of words W is also the set of letters L so Hw/S = HLand C = 0, indicating that no learning has occurred. At the other extreme the word code is factorial for which2Hw/S = H, where H is the total entropy per letter of the text: (2.4)
[It is well known (Shannon and Weaver 1949) that H w / s 5 H with equality only when the words w in W become completely independent.] So 2This is true because the word code is reversible, .so H is invariant; for another type of reversible code see Redlich (1992).
Redundancy Reduction Strategy
295
the learning measure C starts out equal to zero and grows as redundancy is reduced until it approaches its maximum value
where Rc is the total redundancy in the text due to correlations between letters. If there are no correlations between letters then H = H L and Rc = 0. It is important to note that although C is bounded from above by Rc (Hw/S is bounded from below by H ) , it can go negative, so the system can in effect unlearn or increase redundancy. This happens when words (or letters) at one stage which are already independent of each other are mistakenly combined into new words. 3 Local Feature Measure F
Now that we have a global learning measure C, how do we go about finding the word/letter combinations W -+ W’, which increase C? For this purpose it is useful to have a local measure of how much an individual new word or feature increases L. Such a local feature measure F can be derived directly from C by calculating the change in C caused by including in W + W’a single new feature. Actually, since increasing C corresponds to decreasing Hw/S, we need to calculate the change in Hw/S. For extra clarity, let us first calculate the change in Hw/S when only two words in Ware combined to form a new word. Assume for simplicity that the current word set W still contains many single letters, including the letters 9’’ and ”n.” Let us see, as an example, how combining these letters into the particular word w = “in” changes Hw/S in 2.3. Following
w+
W’
where Pi,,, Pi, and P , denote the probabilities of the example word “in,” and of the letters ’?“ and “n.“ Also the “in“ terms have been separated out, so the sum C still runs over the old set W (assuming “i” and/or “n” still exist as independent elements in the set W’). To calculate the change HL/S’ - Hw/S, the new probabilities P‘ must be expressed in terms of the old probabilities P. This is easily accomplished using Pw = N w / N , where Nw = number of times word w appears, and N = total word count for the text (later N can be taken to infinity). After combining ”i” and “n” into “in,” the number N -+ N’= N - Nin, since every time the word “in” occurs it is counted as one word in W’, but was
A. Norman Redlich
296
counted as two words in W. Likewise, Ni Therefore,
-+
Ni - Nin,and N ,
-+
N , - Ni,.
Substituting these P’ into 3.1 gives
=
s/(1 -Pin)
(3.3)
so that
which defines the feature measure 3
The original average word length 3 has not been included in the definition of 3 because it is the same for all new features built out of W. As promised, this feature measure depends only on the local data Pi,, Pi, and P,. The local measure 3 can also be derived in general for new words of any length. We need to take into account the number m of old words making up the new word, as well as the number of times, m,, each old word w appears. For example if the new word “trees” is built out of the old words ”tr,” “e,” and “s” in W,then m = 4, while mt,= 1, m, = 2, m, = 1 giving Em, = m. The new probabilities P can then be derived from Pw and m, using counting arguments only slightly more complicated than before. Thus, denoting the new word by f, for feature,
Redundancy Reduction Strategy
297
N -t N‘ = N - (m - l)Nf, while N, + N , - m,Nf for w in the set W, of old words in f , and N , + N , otherwise. With these adjustments, the general feature measure defined by 3.4 is
F
=
F(P,,P, w E W,)
I
-[1 - (m- l)P,] log[l - (m - l)Pf]
(3.6)
This reduces to 3.5 in the special case f = “in,” W,= {”i ” ”n”) , m = 2 , mi = 1, m, = 1. To gain some intuition into just what statistical properties are measured by F it is useful to approximate 3 for the case Pf << P,, for all w E W,.In the case of English text, this is a good approximation in the first stages of redundancy reduction. The approximate F is I
Fppl(log(
)
n,, pf w, Ew -1)
(3.7)
using log base e. In this form, F bears its closest resemblance to the single-unit G-maximization measure of Hinton and Pearlmutter (1986). This is because the first term in 3.7, that is, neglecting -Pf,is like a single term in the Kullback information when the original probability estimate is statistical independence of inputs. The actual Kullback information requires summing such terms over all features f . For the feature f to significantly reduce redundancy, F needs to be strongly positive and for this, we see from 3.7 that the feature must have two statistical properties: First, the term in parentheses must be large and positive, which requires Pf >> n P w . This term plus one is the mutual information (Shannon and Weaver 19491, which measures how strongly the components that make up the feature? are correlated. A good example in English is “qu” which has high mutual information because “q” always predicts “u” so the “u’’ following “q” is redundant. The second requirement is that the feature be relatively frequent since Pf multiplies the mutual information. Otherwise, the feature could be highly self-correlated, but not common enough to significantly reduce the global redundancy. This is very important, since the mutual infor31n physics language, the feature is analogous to a bound state like an atom built out of protons and electrons. The mutual information is then proportional to the difference between the bound state (feature)energy and the sum of the energies of its components. This is the amount of energy that is gained by building the bound state (atom).
298
A. Norman Redlich
mation alone tends to favor very rare features composed of very rare elements. On the other hand, large P, alone is a dangerous criterion since there are many common features with small or even negative mutual information. Including these in the new set W actually increases the redundancy, since it effectively creates a correlated structure out of already statistically independent elements. In English text an example of a redundancy increasing feature is ”tte,” built out of “t” and “e.” 4 Experimental Results
F can be applied to devise a redundancy reduction algorithm for English text. One simple strategy would be to find at each stage the single new word which has the largest F,and thus find W from W. However, in practice it turns out that a far more time efficient approach is to find the set of new words with largest F,say 10 to 100 new words at each step. Another computational efficiency is gained by limiting the number m of component words to some small number, such as three or four. It turns out that for English text-likely also for many other ensembles-using only up to third-order correlations at each stage (m = 2,3)is sufficient, since larger words are most often composed of redundant subwords. To experimentally test how well this works I applied it to learning about 25 pages of a well known children’s book (Carroll 1865), chosen for its moderately sized vocabulary of roughly 1700 (real English) words. After eliminating all punctuation, capitals, and word spaces, the excerpt contained approximately48,000 characters. The letter entropy was found to be HL= 4.16 bits, while the entropy per letter for real English words is 2.17 bits. Figure l a shows a small piece of the text after it was stripped of any redundancy clues. Spaces are used between letters to indicate that they are being treated here as separate “words.” Figure lb-f then show the text sample in various stages of redundancy reduction. At each stage, when new words are built the spaces between their component words are eliminated. Figure 1 shows the results of using only second- and third-order joint probabilities at each step to find roughly 10 to 100 new words per stage. About 20 such stages were required to get the redundancy down to the Hw = 2.35 bits of Figure l e (only 5 of the 20 stages are actually shown in the figure) which is close to the real word entropy. Even computing all second- and third-order joint probabilities, these results represent only a few hours computation on a Macintosh computer. But the computation time and array storage needed can be reduced even further by calculating the joint probabilities only for a sample of possible new words, as will be discussed in the next section. Figure 2 shows the improvement possible using up to fourth-order probabilities; only the last stage is shown in the figure. Since there is only a small improvement over the third-order result, this demonstrates that fourth-order is not absolutely necessary.
Redundancy Reduction Strategy
299
alice was heginning toget verytiredof sitting hy hersister onthehan k and of having nothingtodo onceortwice shehad peeped intothe hook hersister was read ing hut ithad no pictures or conversation s in it and what is theuseof a hook thoughtalice without pictures or conversation s so shewas consider ing in her ow n mind aswell asshecould for the hot day made her feel very sleepy and stupid whether the plea sure of making ad a is y ch a in wouldhe wor th the trouhle of gettinypand p i c king the d a is ies when suddenly a whiterahhit with p in k eyes r an close hy her therewas nothing so very remark ahle in that n or did alice think it so very much outoftheway to hear therahhit saytoitself ohdear ohdear
Figure 2: The same sample of text, but using up to fourth-order correlation per stage instead of the third-order limit in Figure 1. Only the last stage is shown. It has Hw/S = 2.28 bits. Reviewing the results in Figure 1, one may note that some real English words, such as “daisies,” are not found, but this is due to the relatively small sample of English text used. In fact, the word “daisies” appears in the text only once so it would have been an error for it to quahfy as a redundant feature. However, the algorithm is superbly sensitive to redundant words which appear in the text as few as two or three times. Another thing to observe is that many groups of real words are combined into single features. Some of this reflects actual redundancy in English, for example “ofthe” is likely a truly redundant combination in English, but many of these, such as “whiterabbit” are only redundant for this sample text. Such real word groupings would have far lower redundancy (lower 37 in a much larger sample text which includes many different subjects and writing styles. The most significant success of the redundancy reduction algorithm is the segmentation of the text, which is almost always broken at the boundary between real words. This efficient segmentation corresponds to finding a cover W (Fig. le) of the entire sample with a small number of words-less than the number of real words. This is close to the smallest number of (approximately) statistically independent words. Such efficient segmentation would not have been found using an algorithm that chooses only high probability words.
5 Neural Implementation
In a ‘,neural” implementation of the algorithm used in Section 4, neurons, or dendrites, calculate the local data Pf and P,. Actually, only Pf needs to be calculated since the P, are computed by the previous stage and may be encoded in the neural output strengths. Finding F(Pf,Pw)still requires some computation, but (especiallyin 3.7) this reduces essentially
300
A. Norman Redlich
to computing logarithms! The real problem is not how to calculate Pf or F(Py), but how to search the space of possible features for those with largest 3. One option is to convert this search to one over a set of (continuous) synaptic weights and then apply gradient descent to maximize 3. This is the technique used by Hinton and Pearlmutter (1986) to maximize the Kullback information. Though its application to 3is somewhat different, I believe it might work, although I have not attempted it. Instead, I wish to explore here a more direct approach which avoids the convergence problems often associated with gradient descent. The simplest and most direct approach would be to exhaustively calculate Pr for all features of size 5 m. Of course m small enough to make this computationally feasible might be too small to discover the redundancy. But, there really is no need for an exhaustive search, since a prerequisite for large 3is large P f , and a more limited sampling will usually find these common features. Then only those common features with sufficiently large 3 need be kept. I now use this to develop a temporal search algorithm. Suppose first that there are a fixed number (smaller than needed for an exhaustive search) of feature neurons at each learning stage, which can be in one of two states, occupied or pee. Occupied neurons respond to one feature, and their job is to quickly calculate a good approximation for F. As soon as the occupied neuron discovers that 3 is below some constant threshold 3*, it becomes free and is available to test another feature. The neurons are mutually inhibiting so no two neurons can be occupied by the same feature. Also there is some ordering to decide which free neuron takes the next possible feature. To approximate 3,a neuron only needs an approximation for Pfsince the P, were calculated by the previous stage. How big Pf needs to be for F(Pf,P,) > F* depends on the probabilities of the input elements P, that make up the feature. In effect, the feature neuron uses a featuredependent threshold A(P,) for P f . (If the criterion were simple frequency of the feature, on the other hand, one would use a fixed threshold A for Pf.) Features that are built out of infrequent inputs w have lower threshold for P f , as can be seen most easily in 3.7. The final ingredient is an approximation for P f ( t ) at time t, where t = 0 is the time when the neuron first picks up its feature. For this, I make a very simple choice: If the feature has occurred only once at time t = 0, then for t > 0 approximate P f ( t )= l/t; if the feature occurs a second time at t = TI use for t > TI, P f ( t ) = 2 / t ; and if the feature has 41t should be noted that - log(P) has a very nice interpretation as the infomation in or improbability of the signal. If neurons have output strengths proportional to the information they carry, then the mutual information, one of the ingredients needed for 3,can be calculated through simple addition of neuronal outputs. This was suggested by Uttley (1979) as one of the attractions of using the mutual information to build a conditional probability computer (Singh 1966). Also, the idea that neurons signal improbability has been proposed by Barlow (19891, and there is evidence for this in the retina.
Redundancy Reduction Strategy
301
+
occurred n 1 times use P f ( t ) = n/t, which eventually approaches the true Pf for large n. If at any time P f ( t ) drops below the threshold A(P,), that is, F ( t ) drops below 7, then the occupied neuron is freed to search for other features. Of course, since for small t P f ( t ) may be a poor approximation, good features will be dropped occasionally, but these are likely to be picked back up again since they must be relatively frequent. On the other hand, the longer a neuron is occupied by a feature, the better the approximate Pf(f)becomes and the less susceptible to such errors. In fact, I have simulated this algorithm for the beginning stages of learning for the sample text used in Section 4,and it finds exactly the same set of features as does an exhaustive search, but it requires far less memory. One may also ask how this learning algorithm compares with other unsupervised “feature” detection algorithms. First, as has been discussed, this approach is related to Hinton and Pearlmutter’s: both favor features with large Pf and with Pf >> n P W ,although theirs is not guaranteed to find a factorial code. The greater distinction is between algorithms that use these criteria and algorithms of the type proposed by von der Malsberg (1973) and by Bienenstock et al. (1982). Those also favor features with large Pf,but they prefer features composed of elements with large P,. This may lead to features with small mutual information, and thus may include false background elements. For words in text this leads to poor segmentation, since many very tightly bound words are composed of relatively rare subwords. 6 Noise and Generalization
As mentioned in the introduction, desirable input information is often encoded redundantly (e.g., words in text) so redundancy can be used to distinguish true signal from noise. This is the case for example when the noise is not correlated with the true signal or with itself. Then the feature detection algorithm still finds the true signal redundancy-the true signal statistics-even though the total signal is noisy. To show this, consider an English text with random noise, that is, a certain fraction of letters, chosen randomly, are incorrect. Taking the same sample text used in Section 4, but with 1/13 letters randomly incorrect, I applied the same algorithm as before. The result, shown in Figure 3, is that only real words and word combinations are chosen by the algorithm, while noisy letters are ignored. So noise does not confuse the feature detection. Once the features have been found, the text can be restored by using the probabilities of the redundant words to predict the noiseincorrect letters, that is, to build Bayesian filters. It should be noted that in order to reconstruct the true text, one needs to know more than just the statistical properties of the noisy input messages. In the above example, one additionally needs to know that the
A. Norman Redlich
302
a licewashegiuningtogetverytiredofsittinghyhersisteron thehankandofharingnothitgtodoonceortwzkeshehadpeep e d i k t o z h e h o o k h e r s i s t e r w a s r e a d in g h u t i Ih a d n o p i I t u r e s hr chnversationsinitandwhvtistheuseofahookthoughtalicew ithoutpiqturesosconv~rsationssrshewasconsideringfnhe rownmiodasweclasshecouldforthehotdaymadeherfeelvery sleepyondstupidwhethfrthL?pleasureofmaiingadahsuchai gwouldheworththetrouhljofgsttingupanzpickingthedaqs~ eswhensuddenlyawhinerahhitwithpinkeyesranclosehahe rtherewasnothingsoderyremarkahleinthatnordidalicethi n e i t s o v e r j m u a h o u t o f t h e w a y t o h e a r t h e r c h h i t s a y toitselfo hdearohdeas (a) alice was beg i u ning toget verytiredof sitting by hersister onthe hank and of ha r ing no th it g todo onceortw z k e shehad peeped i k to z he hook hersister was read ing hut ithad no pi1 tures hr c h n ver sation s in it and w h v t i s thcuseof a hook thoughtalice without p i q tures o s conversation s s r shcwas considering f n her own m i o d a swec las shecould for the hot day madeher feelvery sleepyand stupi d whe th f r the plea sure of m a i ing a d a h sucha i g wouldhe worth theu ou h I j of gening up an z pick ing the d a q s i e s whensudden ly a w h in e rahhit with p in key e s r anc 10s e b a her therewas nothing s o der y remark able in that nor d idalice thine its over j m u a h outoftheway tohear ther c hhit say toitself ohdear oh d e a s (b)
Figure 3: Again, the same sample of text as in Figure 1, but with one out of 13 letters randomly incorrect. The noisy text before any redundancy reduction is shown in (a); it has HL = 4.26 bits, which is slightly higher than the original text because it is less correlated. One of the later stages in redundancy reduction is shown in (b);it has entropy per letter Hw/3 = 2.99 bits. Note that the noise does not confuse the algorithm into finding false words or word combinations. noise is random. In other words, one needs at least some outside knowledge or supervision. For example, mean squared filtering that uses the autocorrelator of an ensemble to filter out noise, can be implemented through a supervised perceptron-type algorithm (see Atick and Redlich 1990b). This leads to an important point: purely unsupervised learning based strictly on statistics, does not lead to conceptualization. This is due to the implicit assumption that every distinguishable input state potentially carries a different message. In conceptualizing, on the other hand, different input states which carry the same useful message are grouped together. This grouping requires some further knowledge that distinguishes signal from noise, or provides a measure of closeness on the signal space (Kohonen 19841, or provides active supervision as in perceptron learning. Also, the information that distinguishes between different members of a concept can be thrown away, as in noise filtering. Since this information reduction effectively lowers the number of input states, it also simplifies the problem of learning and storing statistics. So one challenge is to in-
Redundancy Reduction Strategy
303
corporate in the present redundancy reduction strategy a controlled or supervised information reduction. Some first steps in this direction have been taken by Linsker (1989) and by Atick and Redlich (1990a), both using the mutual information between the desired scene data and the noisy image signal (for a different application of redundancy reduction to supervised learning, see Redlich 1992).
Acknowledgments I thank J. Atick for his very perceptive comments on the manuscript. Also, this work was supported in part by a grant from the Seaver Institution and in part by the DOE DE-FG02-90ER40.542.
References Atick, J. J., and Redlich, A. N. 1990a. Towards a theory of early visual processing. Neural Comp. 2,308-320. Atick, J.J., and Redlich, A. N. 199Ob. Predicting ganglion and simple cell receptive field organizations. Int. J. Neural Syst. 1,305. Atick, J.J., and Redlich, A. N. 1991. Convergent algorithm for sensory receptive field development. Neural Comp. In press. Atick, J. J., and Redlich, A. N. 1992. What does the retina know about natural scenes? Neural Comp. 4,196-210. Attneave, F. 1954. Some informational aspects of visual perception. Psychol. Rev. 61,183-193. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. A. Rosenblith, ed. MIT Press, Cambridge, MA. Barlow, H. B. 1989. Unsupervised learning. Neural Comp. 1,295-311. Barlow, H.B., and Foldiak, F! 1989. In The Computing Neuron. Addison-Wesley, New York. Bienenstock, E. L., Cooper, L. N., and M u m , F! W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. I. Neurosci. 2,3243. Carroll, L. 1865. Alice in Wonderland. Castle, Secaucus. Eriksson, K., Lindgren, K., and Mansson, B. A. 1987. Structure, Context, Complexity, Organization, Chap. 4. World Scientific, Singapore. Geman, S., and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. Trans. Patt. Anal. Machine Intell. PAMI-6, 72141. Hinton, G. E., and Pearlmutter, B. A. 1986. G-maximization: an unsupervised learning procedure for discovering regularities. In Neural Nehvorks for Computing, A l p Conference Proceedings, Snowbird, UT,J. S. Denker, ed. AIP Press, New York. Hinton, G. E.,and Sejnowski, T. J. 1983. Optimal perceptual inference. Proceedings of the IEEE Conference on Computer Vision Pattern Recognition, 448453.
304
A. Norman Redlich
Kersten, D., 1990. Statistical limits to image understanding. In Vision: Coding and Eficiency, C. Blakemore, ed. Cambridge University Press, Cambridge. Kohonen, T., 1984. Self Organization and Associative Memory, Springer-Verlag, Berlin. Kullback, S., 1959. Information Theory and Statistics. Wiley, New York. Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Advances in Neural Information Processing Systems, D. S. Touretzky, ed., Vol. 1, pp. 186-194. Morgan Kaufmann, San Mateo, CA. Redlich, A. N. 1992. Supervised factorial learning. Preprint. Shannon, C. E., and Weaver, W. 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana. Singh, J., 1966. Great Ideas in lnformation Theory, Languageand Cybernetics, Chap. 16, Dover, New York. Uttley, A. M., 1979. Information Transmission in the Nervous System. Academic Press, London. von der Malsburg, C. 1973. Self-organization of orientation sensitive cells in the striate cortex. Kybernetik 14, 85-100. Received 12 December 1991; accepted 29 September 1992.
This article has been cited by: 2. Jim W. Kay, W. A. Phillips. 2010. Coherent Infomax as a Computational Goal for Neural Systems. Bulletin of Mathematical Biology . [CrossRef] 3. S. M. Boker, J. F. Cohn, B.-J. Theobald, I. Matthews, T. R. Brick, J. R. Spies. 2009. Effects of damping head movement and facial expression in dyadic conversation using real-time facial expression tracking and synthesized avatars. Philosophical Transactions of the Royal Society B: Biological Sciences 364:1535, 3485-3495. [CrossRef] 4. M.A. Sanchez-Montanes, F.J. Corbacho. 2004. A New Information Processing Measure for Adaptive Complex Systems. IEEE Transactions on Neural Networks 15:4, 917-927. [CrossRef] 5. J. Michael Herrmann. 2001. Dynamical systems for predictive control of autonomous robots. Theory in Biosciences 120:3-4, 241-252. [CrossRef] 6. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 7. D. Obradovic , G. Deco . 1998. Information Maximization and Independent Component Analysis: Is There a Difference?Information Maximization and Independent Component Analysis: Is There a Difference?. Neural Computation 10:8, 2085-2101. [Abstract] [PDF] [PDF Plus] 8. Jean-Pierre Nadal , Nestor Parga . 1997. Redundancy Reduction and Independent Component Analysis: Conditions on Cumulants and Adaptive ApproachesRedundancy Reduction and Independent Component Analysis: Conditions on Cumulants and Adaptive Approaches. Neural Computation 9:7, 1421-1456. [Abstract] [PDF] [PDF Plus] 9. J. Gerard Wolff. 1995. Computing as compression: An overview of the SP theory and system. New Generation Computing 13:2, 187-214. [CrossRef] 10. Gustavo Deco, Bernd Schürmann. 1995. Learning time series evolution by unsupervised extraction of correlations. Physical Review E 51:3, 1780-1790. [CrossRef] 11. G. Deco , W. Finnoff , H. G. Zimmermann . 1995. Unsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer NetworksUnsupervised Mutual Information Criterion for Elimination of Overtraining in Supervised Multilayer Networks. Neural Computation 7:1, 86-107. [Abstract] [PDF] [PDF Plus] 12. A. Norman Redlich . 1993. Supervised Factorial LearningSupervised Factorial Learning. Neural Computation 5:5, 750-766. [Abstract] [PDF] [PDF Plus]
Communicated by Halbert White
Approximation and Radial-Basis-Function Networks Jooyoung Park Irwin W.Sandberg Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX 78712 USA
This paper concerns conditions for the approximation of functions in certain general spaces using radial-basis-function networks. It has been shown in recent papers that certain classes of radial-basis-function networks are broad enough for universal approximation. In this paper these results are considerably extended and sharpened.
1 Introduction
This paper concerns the approximation capabilities of radial-basis-function (RBF) networks. It has been shown in recent papers that certain classes of RBF networks are broad enough for universal approximation (Park and Sandberg 1991; Cybenko 1989). In this paper these results are considerably extended and sharpened. Throughout this paper, we use the following definitions and notation, in which N and R denote the natural numbers and the set of real numbers, respectively, and, for any positive integer Y, R' denotes the normed linear space of real r-vectors with norm 11 11. (., .) denotes the standard inner product in R'. LP(%'), Lm(%'),and C,(Rr), respectively, denote the usual spaces of R-valued maps f defined on 92' such that f is pth power integrable, essentially bounded, and continuous with compact support. With W c R', C(W) denotes the space of continuous %-valued maps d e fined on W.The usual LP and uniform norms are denoted by 11 . (Ip and 11. l m, respectively. The characteristic function of a Lebesgue measurable subset A of R' is denoted by 1 ~ The . convolution operation is denoted by "*," and the Fourier transform (Stein and Weiss 1971) of a Fouriertransformable function f is written as By a cone in X' we mean a set C C R' such that x E C implies that a x E C for all a 1 0. By a proper cone we mean a cone that is neither empty nor the singleton (0). The block diagram of a typical RBF network with one hidden layer is shown in Figure 1. Each unit in the hidden layer of this RBF network has its own centroid, and for each input x = (xl, x2, . . . ,xr), it computes the distance between x and its centroid. Its output (the output signal at one
i.
Neural Computation 5,305-316 (1993) @ 1993 Massachusetts Institute of Technology
306
JooyoungPark and Irwin W.Sandberg
Figure 1: A radial-basis-functionnetwork.
of the kernel nodes) is some nonlinear function of that distance. Thus, each kernel node in the RBF network computes an output that depends on a radially symmetric function, and usually the strongest output is obtained when the input is at the centroid of the node. Each output node gives a weighted summation of the outputs of kernel nodes. We first consider RBF networks represented by functions q : '8' 4 B of the form
where M E N is the number of kernel nodes in the hidden layer, Wi E W is the weight from the ith kernel node to the output node, x is an '1,and K is the common radially symmetric input vector (an element of 8 kernel function of the units in the hidden layer. Here zi E 92' and CY > 0 are the centroid and smoothing factor (or width) of the ith kernel node, respectively. We call this family S o ( K ) . Note that the networks in this family have the same positive smoothing factor in each kernel node. Families with a translation-invariant vector space structure are also often important. For example, networks are widely used in which the smoothing factors are positive real numbers as in So(K), but can have different values across kernel nodes. This family is the smallest vector space among those containing So(K) as a subset. We call this vector space
Approximation and Radial-Basis-Function Networks
S,(K). Its general element q : R'
-+
307
?J? is represented by
where M E N, oi > 0, wi E R, and zi E R' for i = 1,2,. . . ,M. For the sake of clarity and convenience, we consider only a onedimensional output space instead of outputs represented by multiple nodes as in Figure 1. The extension of our results to multidimensional output spaces is trivial. Notice that the kernel function K characterizes the families So(K) and S l ( K ) , and that each kernel node has its output derived from K indexed by two parameters (the centroid and smoothing factor), one for position and the other for scale. Ordinarily K is radially symmetric with respect to the norm 11 11 in the sense that llxll = llyll implies K ( x ) = K(y). However, as we shall see in the next section, radial symmetry of the kernel function K : R' R is needed in the development of only one of the approximation results in this study, Except where indicated to the contrary, radial symmetry of the kernel function K is not assumed. In Park and Sandberg (1991)it is shown that S o ( K ) is dense in LP(R'), p E [l,00) if K is an integrable bounded function such that K is continuous almost everywhere and Jsl,K ( x ) dx # 0. In Cybenko (1989)it is pointed out that a consequence of a generalization of a theorem due to Wiener is that the elements of a vector space related to & ( K ) are capable of approximating functions in L1 (8').The purpose of this paper is to report on a substantial sharpening of the results in Park and Sandberg (1991) and Cybenko (1989). -+
2 Approximation Results
As mentioned above, in Park and Sandberg (1991)it is shown that So(K) is dense in Lp(%'), p E [l,00) if K is an integrable bounded function such that K is continuous almost everywhere and Jslr K ( x ) dx # 0. Our first theorem concerns the p = 1 case; a necessary and sufficient condition is
given for approximation with So(K).
Theorem 1. Assuming that K : 3' -+ R is integrable, S,-,(K) is dense in L'(R') ifand only ifJR,K ( x )dx # 0. Proof. Suppose first that JR, K ( x ) dx # 0, and define ] = I Jslr K ( x )dxl. Let f E L' (3') and E > 0 be given. Since Cc(3') is dense in L' (R') (Rudin 19871, we can choose a nonzero fc E Cc(Rr)such that
Ilf - f c l l 1
Since fc has a compact support, there exists a positive T such that supp fc c [-T,TI'
(1)
Jooyoung Park and Irwin W. Sandberg
308
Choose a function Kc E Cc(Rr)such that
and
By Lemma 1 (in the appendix), we have llfc
- $u
*fClll
+
0
as
-+
0
Choose 0 > 0 such that llfc
- 4u
(3)
*fClll < 4 4
Note that &(a - .)fc(.) is Riemann integrable on [-T, TI', since $, and fc are each continuous and bounded. Define vn : 8' + R by
where the set of the form -T
{ai
+ -,2ilT
E R' : i = 1,2,. . . ,nr} consists of all points in [-T, TI'
. . . , -T
+
2irT1
,
, . . . ,ir = 1,2, . . . ,n
il i2,
Note that v,(a) is a Riemann s u m for &T,Tl, &(a - x)fc(x) dx, and that
Approximation and Radial-Basis-Function Networks
309
Thus, for any a E Sr,
vn(a) ( h* f c ) ( a ) as IZ 00 Since (&*fc) and the v, are dominated by an integrable bounded function with compact support, by the dominated convergence theorem +
Ar
+
as
1(4o *fc>(a) -vn(a>lda + 0
n
+
00
Thus, there is an N E N for which
Since
ijN
: %r + S defined by
So(K) is dense in L1(Sr). To show the "only if" part, we prove the contrapositive: Assume that JR,K(x)dx = 0. Then for anyf E L1(Sr) such that JR,f(x)dx A I > 0, there is no g E So(K) satisfymg [ I f - gill < J / 2 ,because
I I -glli ~
2
J , , v c ~-g(x)ldx= )
J f(x)dx=~ 821
for g E S o ( K ) . Thus So(K) is now not dense in L1(%'), which completes the proof. 0 Since the family So(K) is a proper subset of S1(K), the ',if" part of this theorem holds also with So(K) replaced by S I ( K ) . A family similar to S1(K) is considered in Cybenko (1989) with regard to the approximation of functions in L1(Rr); it is noted there that if K E L1(Sr) and JRr K ( x ) d x #
Jooyoung Park and Irwin W. Sandberg
310
0, then the family S 2 ( K ) consisting of functions q : R' form is dense in L'(R'): M
q(X) =
C Wi
*
K(tiX
+
--t
R of the following
yi)
i=l
where M E N , t; E R, and yi E R' for i = 1,.. .,M. The proof of this follows immediately from a generalization (Rudin 1973, Theorem 9.4) of a theorem due to Wiener. For the readers' convenience we state the theorem in the appendix. Here we make some pertinent observations: 1. Sl(K) defined above is a proper subset of S2(K), and even SI(K) is dense in L'(R') under the condition & , K ( x ) d x # 0. This can be easily shown: Assume to get a contradiction that there is an SO E R' such that SO) = 0 for all f E SI(K). Then K(as0) = 0 for all a > 0, since
Since K is continuous, the above implies that K(0) =
1 K(x) R'
dx = 0
which contradicts the nonzero-integral condition. The main difference between S1(K) and Sz(K) is the set from which the smoothing factors are drawn. In this connection, it is easy to see that our conclusion here, and also in Theorem 1, can be strengthened in that they hold if our ai > 0 and a > 0 conditions are replaced by the conditions that a; E S and a E S, where S is any subset of (0,oo) such that zero is a cluster point of S. Also, note that the denseness of SI(K) in L1(Rr) is a corollary of Theorem 1. 2. When K : R' -+ R is integrable, S I ( K ) is dense in L'(R') only if JslrK ( x ) dx # 0. The "only if" part of the proof of Theorem 1 shows this. The above observations give the following theorem:
Theorem 2. Assuming that K : R' if and only if JR, K ( x ) dx # 0.
+R
is integrable, SI(K) is dense in L1(Rr)
Up to this point our results concern the approximation of functions in L'(8') under the condition that JRr K ( x ) dx # 0. As shown above, this condition is necessary for approximation with So(K) or SI(K). A natural question that arises is whether the nonzero-integral condition is necessary for approximation in Lp(Rr), p E (1,m). We will see below that it is not necessary for p = 2.
Approximation and Radial-Basis-Function Networks
311
In the following theorem, attention is focused on kernel functions K : 8' .+ R with the property that for all M c R' with positive measure there is a ~7 > 0 such that K(a.)# 0 almost everywhere on some positive measure subset of M . We call such K pointable. We shall use the fact that the negation of this condition on K is that for some M of positive measure, K(P) = 0 almost everywhere on M for all (T > 0.
Theorem 3. Assuming that K : R' + R is a square integrablefunction, SI(K) is dense in L2(Rr) ifand only if K is pointable.
Proof. We make use of the following characterization of closed translationinvariant subspaces of L2(Rr), which is an easy modification of (Rudin, 1987, Theorem 9.17). 0 Lemma 2. Associate to each measurable set E c R' the linear space MEof all f E L2(Rr) such that = 0 almost everywhere on E . Then each M E is a closed translation-invariant subspace of L2(Rr),and every closed translation-invariant subspace of L2(Rr) is M Efor some E . Consider any K satisfying the indicated conditions, and suppose that the closure of S I ( K ) is not L2(Rr). Then, since this closure is translationinvariant, by Lemma 2 there is a measurable subset E of R' having positive measure such that
f =0
almost everywhere on E
for any f in the closure of
S1 (K).
a' exp(-2742, .))K(a.)= 0
In particular, almost everywhere on E
for any z E $2' and ~7 > 0. Thus, K(a.)= 0 almost everywhere on E for all a > 0, which contradicts our supposition. To show the "only if" part, we prove the contrapositive: Assume that there is a measurable set M c R' with positive measure such that
K(P) =0
almost everywhere on M
for all a > 0. Then for any f E L2(Rr) with ] g E S, ( K ) satisfying [If - g((2< ]/2, because
2 This completes the proof.
IlflMll2 =
l l f l ~ 1 1 2>
0,' there is no
I 0
'Here we use (1 . 112 to denote also the usual norm on the space of complex-valued square-integrable functionals.
JooyoungPark and Irwin W. Sandberg
312
A large class of kernel functions satisfies the conditions of pointability. For example, kernel functions K such that K # 0 almost everywhere on some ball centered at the origin are pointable. Note that this class includes functions K with JW, K ( x ) dx = 0. A result for the general LP(Rr) case along the lines of the "if part" of Theorem 2 is:
Proposition 1. With p that
E
(1,oo),let K : Rr + R be an integrublefunction such
and
Then SI(K) is dense in LP(gr). Proof. Suppose that S I ( K ) is not dense in LP(Rr). Then by the HahnBanach theorem (Rudin 19871, there exists a bounded linear functional A on Lp(Rr) such that A[the closure of S , ( K ) ] = (0)
(6)
but
w p ( n r )#)(01 By the Riesz representation theorem (Rudin 1987), A : L p ( a r )+ R can be represented by
for some function 8,f in Lq(Rr)? where q is the conjugate exponent of p defined by 1/p + l / q = 1. In particular, from equation 6
for any z E Rr and > 0. Define K : Rr -+ R and KO : Rr + R for u > 0 by
and
(-)
1- x K b ( x ) = 7K D
O
*The strategy of using the Hahn-Banach theorem together with representations of linear functionals was first used in the neural-networks literature in Cybenko (1989).
Approximation and Radial-Basis-Function Networks
313
Note that for any u > 0 and z in IR',
Since K E L1(IRr) and
JR,
K ( x )dx = 1, by Lemma 1 (in the appendix),
II&* g n -gAllq -, 0
as
(r
-,0
(8)
By 7 and 8, we conclude that gA is zero almost everywhere. This implies that A is the zero functional, which contradicts our supposition. Our focus has been on LP approximation. We next give a theorem concerning the uniform approximation of continuous functions on compact subsets of W. Theorem 4. Let K : IR' -,IR bean integrablefunction such that K is continuous Then &(K) is dense in C(W )with and such that K - I ( O ) includes no proper respect to t h e n o m 11 . Ilmforany compact subset WofIR".
Proof. Consider any compact subset W of IR'. Suppose that & ( K ) is not dense in C(W). Then proceeding as in the proof of Proposition 1, we see that there is a nonzero finite signed measure p that is concentrated on W and that satisfies
for any z E IR' and u > 0. With z E R', u > 0, and any function h E L1(Sr)n Lw(Rr) whose Fourier transform has no zeros4 (e.g., the gaussian function exp(-all. I[$) with a > 0), consider the integral
Note that
where lpl is the total variation of p. By equation 9 and Fubini's theorem (see, e.g., Rudin 19621,we have
- (' -
O
= Jr[k1K[
u
] d p ( y )] h(x)dx
3Since k(-w) equals the conjugate of K(w) for any w in R', this condition can be stated in terms of subspaces instead of cones. "ere we use a strategy along the lines of Hornick (1991, proof of Theorem 5).
JooyoungPark and Irwin W. Sandberg
314
By the change of variable x+y equation 10 is equivalent to
--+
x and Theorem 3:4.5 of Petersen (1983),
Note that h * p is integrable (by Theorem 1:4.5 of Petersen 1983). It is also essentially bounded, because Kh * P)(X)l
J,, lh(x - Y)I dlPl(Y)
I
5 Ilhllml/4(Rr) for almost all x E V. Consider the closed translation-invariant subspace I of L'( !I? defined ) as the L'-closure of S l ( K ) . By equation 11 and the essential boundedness of h * p, it easily follows that
for any f in I. Following the notation in Rudin (19621, define the zero set Z ( I ) of I to be the set of w where the Fourier transforms of all functions in I vanish. We claim that a nonzero element in Rr cannot be a member of Z ( I ) when K-'(0) includes no proper cone. Assume to get a contradiction that w # 0 and w E Z ( I ) . Then, using the definition of Z ( I ) ,
K
(5)(w)
= urexp(-2ai(z,w))K(aw) = o
for any z E R' and c > 0. This implies that &ow) = o
for all u > o
Since K is continuous, this means that k-'(O) includes the cone {uw E Rr : u 2 0}, which contradicts the cone condition. Thus, Z(1) is either the empty set or (0). In either case, by Theorems 7.1.2 and 7.2.4 of Rudin (19621, any integrable function from Rr to R with zero integral is a member of I. Thus, equation 12 gives
for any f in L'(Rr) wth JR,f(x)dx = 0.
Approximation and Radial-Basis-Function Networks
315
Note that the property 13 can hold only for h * p in the class of almost everywhere constant functions. But since h * p E L'(R) and zero is the only constant function in L1(Rr),we have
h*p =0
almost everywhere.
(14)
Since h has no zeros, by Theorem 2:2.2 of Petersen (1983) and Theorem 1.3.6 of Rudin (19621, equation 14 implies p = 0. This contradicts our supposition, and thus proves the theorem. A corollary of this theorem is that S1(K)is dense in C( W) for any compact subset W of $2' when the kernel K : 8' + R is integrable, continuous and satisfies Jsl,K ( x ) dx # 0. Finally, when K : R' -+ R is integrable and radially symmetric with respect to the Euclidean norm, K is also radially symmetric with respect to the Euclidean norm (Bochner and Chandrasekharan 1949, p. 69). In this setting, every K not equivalent to the zero element of L'(%') satisfies the cone condition of Theorem 4. This observation gives the following: Theorem 5. Let K : 92' + R be a nonzero integrable function such that K is continuous and radially symmetric with respect to the Euclidean norm. Then S1( K ) is dense in C( W)with respect to the norm (1 . lloofor any compact subset W of Rr.
3 Concluding Remarks
The results in this paper significantly improve previous results. In particular, we have given sharp conditions on the kernel function under which radial-basis-function networks having one hidden layer are capable of universal approximation on R' or on compact subsets of Rr. A related result concerning uniform approximation using the elements of So(K) with integrable K is given in Park and Sandberg (1991, p. 254). The results in Section 2 concern the approximation of real-valued functions. Approximations for complex-valued functions are also of interest. In this connection, it is a straightforward exercise to verify that Theorems 1-5 and Proposition 1 remain true if " K : Rr R" is replaced with the condition that K maps R' into the set C of complex numbers, LP(Rr) denotes instead the corresponding space of C-valued functions, the elements of C( W)are taken to be C-valued, and So(K) and S1(K)refer instead to the corresponding sets in which the weights wi are drawn from C. An important problem we have not addressed is that of determining the network parameters so that a prescribed degree of approximation is achieved. -+
Jooyoung Park and Irwin W. Sandberg
316
Appendix
L e m m a 1. Letf function such that
E
Lp(Rr), p
E
[I,oo),and let q ? ~ : Rr + R be an integrable
s,, $44
dx = 1
Define q?Jc : Rr + 2 b y
q?JN = (l/Er)q?J(x/E) for 6 > 0. Then
*f -flip -, Oas E
.+0.
Theorem 9.4 of Rudin (1973). If Y is a closed translation-invariant subspace of L' (W), and if
Z(Y)= nfEv{sE R' :?(s)
= 01
is empty, then Y = I-.'(%'). References Bochner, S., and Chandrasekharan, K. 1949. Fourier Transforms. Princeton University Press, Princeton. Cybenko, G. 1989. Approximation by superpositions of a sigmoidal function. Math. Control, Signals, Syst. 2, 303-314. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 251-257. Park, J., and Sandberg, I. W. 1991. Universal approximation using radial-basisfunction networks. Neural Comp. 3, 246-257. Petersen, B. E. 1983. Introduction to the Fourier Transform and Pseudo-Differential Operators. Pitman, Marshfield, M A . Rudin, W. 1962. Fourier Analysis on Groups. Interscience Publishers, New York. Rudin, W. 1973. Functional Analysis. McGraw-Hill, New York. Rudin, W. 1987. Real and Complex Analysis, 3rd ed. McGraw-Hill, New York. Stein, E. M., and Weiss, G. 197l. Introducfion to Fourier Analysis on Euclidean Spaces. Princeton University Press, Princeton. Received 13 April 1992;accepted 3 September 1992.
'This lemma is used in Park and Sandberg (1991)where it is observed to be a slight modification of a theorem in Bochner and Chandrasekharan (1949). We have since found earlier proofs of the lemma (e.g., Petersen 1983, p. 72).
This article has been cited by: 2. Hong-Jiang Wang, Chi-Sing Leung, Pui-Fai Sum, Gang Wei. 2010. Kernel Width Optimization for Faulty RBF Neural Networks with Multi-node Open Fault. Neural Processing Letters 32:1, 97-107. [CrossRef] 3. María D. Perez-Godoy, Antonio J. Rivera, Francisco J. Berlanga, María José Del Jesus. 2010. CO2RBFN: an evolutionary cooperative–competitive RBFN design algorithm for classification problems. Soft Computing 14:9, 953-971. [CrossRef] 4. Arta A. Jamshidi, Michael J. Kirby. 2010. Skew-Radial Basis Function Expansions for Empirical Modeling. SIAM Journal on Scientific Computing 31:6, 4715. [CrossRef] 5. He Huang, Ji-cheng Bai, Ze-sheng Lu, Yong-feng Guo. 2009. Electrode wear prediction in milling electrical discharge machining based on radial basis function neural network. Journal of Shanghai Jiaotong University (Science) 14:6, 736-741. [CrossRef] 6. Masoud Mirmomeni, Caro Lucas, Masoud Shafiee, Babak N. Araabi, Elaheh Kamaliha. 2009. Fuzzy descriptor systems and spectral analysis for chaotic time series prediction. Neural Computing and Applications 18:8, 991-1004. [CrossRef] 7. Hui Li, Yuping Zhang, Haiqi Zheng. 2009. Gear fault detection and diagnosis under speed-up condition based on order cepstrum and radial basis function neural network. Journal of Mechanical Science and Technology 23:10, 2780-2789. [CrossRef] 8. Dongchuan Yu, Stefano Boccaletti. 2009. Real-time estimation of interaction delays. Physical Review E 80:3. . [CrossRef] 9. Yan Hua, Wei Ping, Xiao Xian-Ci. 2009. A method to improve the precision of chaotic time series prediction by using a non-trajectory. Chinese Physics B 18:8, 3287-3294. [CrossRef] 10. Antonio Sánchez-García, Patricio Muñoz-Esparza, José Luis Sancho-Gomez. 2009. A novel image-processing based method for the automatic detection, extraction and characterization of marine mammal tonal calls. Journal of the Marine Biological Association of the United Kingdom 1. [CrossRef] 11. M. Islam, A. Sattar, F. Amin, Xin Yao, K. Murase. 2009. A New Adaptive Merging and Growing Algorithm for Designing Artificial Neural Networks. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:3, 705-722. [CrossRef] 12. D. Achela K. Fernando, Asaad Y. Shamseldin. 2009. Investigation of Internal Functioning of the Radial-Basis-Function Neural Network River Flow Forecasting Models. Journal of Hydrologic Engineering 14:3, 286. [CrossRef] 13. Bor-Shyh Lin, Bor-Shing Lin, Fok-Ching Chong, Feipei Lai. 2009. Higher Order Statistics-Based Radial Basis Function Network for Evoked Potentials. IEEE Transactions on Biomedical Engineering 56:1, 93-100. [CrossRef]
14. Dong Nan, Wei Wu, Jin Ling Long, Yu Mei Ma, Lin Jun Sun. 2008. L p approximation capability of RBF neural networks. Acta Mathematica Sinica, English Series 24:9, 1533-1540. [CrossRef] 15. Alberto Guillén, Ignacio Rojas, Jesús González, Héctor Pomares, L. J. Herrera, O. Valenzuela, F. Rojas. 2007. Output value-based initialization for radial basis function neural networks. Neural Processing Letters 25:3, 209-225. [CrossRef] 16. Ali Gholipour, Caro Lucas, Babak N. Araabi, Masoud Mirmomeni, Masoud Shafiee. 2007. Extracting the main patterns of natural time series for long-term neurofuzzy prediction. Neural Computing and Applications 16:4-5, 383-393. [CrossRef] 17. Enrique Romero, Daniel Toppo. 2007. Comparing Support Vector Machines and Feedforward Neural Networks With Similar Hidden-Layer Weights. IEEE Transactions on Neural Networks 18:3, 959-963. [CrossRef] 18. Bor-Shyh Lin, Bor-Shing Lin, Fok-Ching Chong, Feipei Lai. 2007. Higher-Order-Statistics-Based Radial Basis Function Networks for Signal Enhancement. IEEE Transactions on Neural Networks 18:3, 823-832. [CrossRef] 19. Koldo Basterretxea, Jos Manuel Tarela, Ins del Campo, Guillermo Bosque. 2007. An Experimental Study on Nonlinear Function Computation for Neural/Fuzzy Hardware Design. IEEE Transactions on Neural Networks 18:1, 266-283. [CrossRef] 20. Alexey Kononov, Dries Gisolf, Eric Verschuur. 2007. Application of neural networks to traveltime computation. SEG Technical Program Expanded Abstracts 26:1, 1785. [CrossRef] 21. Masoud Mirmomeni, Caro Lucas, Babak Nadjar Araabi, Masoud Shafiee. 2007. Forecasting sunspot numbers with the aid of fuzzy descriptor models. Space Weather 5:8. . [CrossRef] 22. S. Padma, R. Bhuvaneswari, S. Subramanian. 2007. Application of soft computing techniques to induction motor design. COMPEL: The International Journal for Computation and Mathematics in Electrical and Electronic Engineering 26:5, 1324-1345. [CrossRef] 23. K. Schwab, M. Eiselt, P. Putsche, M. Helbig, H. Witte. 2006. Time-variant parametric estimation of transient quadratic phase couplings between heart rate components in healthy neonates. Medical & Biological Engineering & Computing 44:12, 1077-1083. [CrossRef] 24. Ali Gholipour, Babak N. Araabi, Caro Lucas. 2006. Predicting Chaotic Time Series Using Neural and Neurofuzzy Models: A Comparative Study. Neural Processing Letters 24:3, 217-239. [CrossRef] 25. D. Wedge, D. Ingram, D. Mclean, C. Mingham, Z. Bandar. 2006. On Global–Local Artificial Neural Networks for Function Approximation. IEEE Transactions on Neural Networks 17:4, 942-952. [CrossRef] 26. Nam Mai-Duy, Thanh Tran-Cong. 2005. An efficient indirect RBFN-based method for numerical solution of PDEs. Numerical Methods for Partial Differential Equations 21:4, 770-790. [CrossRef]
27. A. Krzyzak, D. Schafer. 2005. Nonparametric Regression Estimation by Normalized Radial Basis Function Networks. IEEE Transactions on Information Theory 51:3, 1003-1010. [CrossRef] 28. G. H. Schmitz, H. Puhlmann, W. Droge, F. Lennartz. 2005. Artificial neural networks for estimating soil hydraulic parameters from dynamic flow experiments. European Journal of Soil Science 56:1, 19-30. [CrossRef] 29. M. Arif, T. Ishihara, H. Inooka. 2004. Intelligent Learning Controllers for Nonlinear Systems using Radial Basis Neural Networks. Control and Intelligent Systems 32:2. . [CrossRef] 30. S.-H. Huh, J.-H. Park, I. Choy, G.-T. Park. 2004. Nonlinear uncertainty observer for AC motor control using the radial basis function networks. IEE Proceedings - Control Theory and Applications 151:3, 369. [CrossRef] 31. J. Gonzalez, I. Rojas, J. Ortega, H. Pomares, J. Fernandez, A. Diaz. 2003. Multiobjective evolutionary optimization of the size, shape, and position parameters of radial basis function networks for function approximation. IEEE Transactions on Neural Networks 14:6, 1478-1495. [CrossRef] 32. Ivan Tyukin , Cees van Leeuwen , Danil Prokhorov . 2003. Parameter Estimation of Sigmoid Superpositions: Dynamical System ApproachParameter Estimation of Sigmoid Superpositions: Dynamical System Approach. Neural Computation 15:10, 2419-2455. [Abstract] [PDF] [PDF Plus] 33. Yoshifusa Ito . 2003. Activation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without ScalingActivation Functions Defined on Higher-Dimensional Spaces for Approximation on Compact Sets with and without Scaling. Neural Computation 15:9, 2199-2226. [Abstract] [PDF] [PDF Plus] 34. Irwin W. Sandberg. 2003. Gaussian radial basis functions and the approximation of input-output maps. International Journal of Circuit Theory and Applications 31:5, 443-452. [CrossRef] 35. E. Lavretsky, N. Hovakimyan, A.J. Calise. 2003. Upper bounds for approximation of continuous-time dynamics using delayed outputs and feedforward neural networks. IEEE Transactions on Automatic Control 48:9, 1606-1610. [CrossRef] 36. Haojian Xu, P.A. Ioannou. 2003. Robust adaptive control for a class of mimo nonlinear systems with guaranteed error bounds. IEEE Transactions on Automatic Control 48:5, 728-742. [CrossRef] 37. Miroslav K�rn�, Josef B�hm, Tatiana V. Guy, Petr Nedoma. 2003. Mixture-based adaptive probabilistic control. International Journal of Adaptive Control and Signal Processing 17:2, 119-132. [CrossRef] 38. Irwin W. Sandberg . 2003. Indexed Families of Functionals and Gaussian Radial Basis FunctionsIndexed Families of Functionals and Gaussian Radial Basis Functions. Neural Computation 15:2, 455-468. [Abstract] [PDF] [PDF Plus]
39. Michael Schmitt . 2002. Descartes' Rule of Signs for Radial Basis Function Neural NetworksDescartes' Rule of Signs for Radial Basis Function Neural Networks. Neural Computation 14:12, 2997-3011. [Abstract] [PDF] [PDF Plus] 40. Michael Schmitt . 2002. Neural Networks with Local Receptive Fields and Superlinear VC DimensionNeural Networks with Local Receptive Fields and Superlinear VC Dimension. Neural Computation 14:4, 919-956. [Abstract] [PDF] [PDF Plus] 41. Ming Zhang, Shuxiang Xu, J. Fulcher. 2002. Neuron-adaptive higher order neural-network models for automated financial data modeling. IEEE Transactions on Neural Networks 13:1, 188-204. [CrossRef] 42. J. Gonzalez, H. Rojas, J. Ortega, A. Prieto. 2002. A new clustering technique for function approximation. IEEE Transactions on Neural Networks 13:1, 132-142. [CrossRef] 43. Abelardo Errejon, E. David Crawford, Judith Dayhoff, Colin O'Donnell, Ashutosh Tewari, James Finkelstein, Eduard J. Gamito. 2001. Use Of Artificial Neural Networks In Prostate Cancer. Molecular Urology 5:4, 153-158. [CrossRef] 44. Irwin W. Sandberg. 2001. Gaussian radial basis functions and inner product spaces. Circuits Systems and Signal Processing 20:6, 635-642. [CrossRef] 45. Judith E. Dayhoff, James M. DeLeo. 2001. Artificial neural networks. Cancer 91:S8, 1615-1635. [CrossRef] 46. Irwin W. Sandberg. 2000. Constructive approximation of non-linear discrete-time systems. International Journal of Circuit Theory and Applications 28:2, 109-120. [CrossRef] 47. I.W. Sandberg. 2000. Time-delay polynomial networks and quality of approximation. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 47:1, 40-45. [CrossRef] 48. I.W. Sandberg. 1999. Corrections to "separation conditions and approximation of continuous-time approximately finite memo. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 46:10, 1305. [CrossRef] 49. I.W. Sandberg. 1999. Separation conditions and approximation of continuous-time approximately finite memory systems. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 46:7, 820-826. [CrossRef] 50. N.B. Karayiannis. 1999. Reformulated radial basis neural networks trained by gradient descent. IEEE Transactions on Neural Networks 10:3, 657-671. [CrossRef] 51. T. Chau, A.K.C. Wong. 1999. Pattern discovery by residual analysis and recursive partitioning. IEEE Transactions on Knowledge and Data Engineering 11:6, 833-852. [CrossRef] 52. M.M. Polycarpou, J.Y. Conway. 1998. Indirect adaptive nonlinear control of drug delivery systems. IEEE Transactions on Automatic Control 43:6, 849-856. [CrossRef]
53. Irwin W. Sandberg. 1998. Separation conditions and approximation of discrete-time and discrete-space systems. Circuits Systems and Signal Processing 17:3, 305-320. [CrossRef] 54. Zi-Jiang Yang, Koichi Hirata, Teruo Tsuji. 1997. Servo motor adaptive speed control under angle-dependent disturbances using an RBF network. Electrical Engineering in Japan 119:4, 77-86. [CrossRef] 55. I.W. Sandberg, L. Xu. 1997. Uniform approximation of multidimensional myopic maps. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 44:6, 477-500. [CrossRef] 56. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 57. Irwin W. Sandberg, Lilian Xu. 1997. Uniform Approximation of Discrete-Space Multidimensional Myopic Maps. Circuits Systems and Signal Processing 16:3, 387-403. [CrossRef] 58. N.B. Karayiannis, G.W. Mi. 1997. Growing radial basis neural networks: merging supervised and unsupervised learning with network growth techniques. IEEE Transactions on Neural Networks 8:6, 1492-1506. [CrossRef] 59. A. Roy, S. Govil, R. Miranda. 1997. A neural-network learning theory and a polynomial time RBF algorithm. IEEE Transactions on Neural Networks 8:6, 1301-1313. [CrossRef] 60. Tin-Yan Kwok, Dit-Yan Yeung. 1997. Objective functions for training new hidden units in constructive neural networks. IEEE Transactions on Neural Networks 8:5, 1131-1148. [CrossRef] 61. Irwin W. Sandberg, Lilian Xu. 1996. Network approximation of input-output maps and functionals. Circuits Systems and Signal Processing 15:6, 711-725. [CrossRef] 62. A. Krzyzak, T. Linder, C. Lugosi. 1996. Nonparametric estimation and classification using radial basis function nets and empirical risk minimization. IEEE Transactions on Neural Networks 7:2, 475-487. [CrossRef] 63. Maxwell B. Stinchcombe . 1995. Precision and Approximate Flatness in Artificial Neural NetworksPrecision and Approximate Flatness in Artificial Neural Networks. Neural Computation 7:5, 1021-1039. [Abstract] [PDF] [PDF Plus] 64. Bjørn Lillekjendlie, Dimitris Kugiumtzis, Nils Christophersen. 1994. Chaotic time series. Part II. System Identification and Prediction. Modeling, Identification and Control: A Norwegian Research Bulletin 15:4, 225-245. [CrossRef] 65. Lipo Wang, Kiuju FuArtificial Neural Networks . [CrossRef]
Communicated by Richard Lippmann
A Polynomial Time Algorithm for Generating Neural Networks for Pattern Classification: Its Stability Properties and Some Test Results Somnath Mukhopadhyay Asim Roy Lark Sang Kim Sandeep Govil Department of Decision and Information Systems, Arizona State University, Tempe, AZ 85287 USA Polynomial time training and network design are two major issues for the neural network community. A new algorithm has been developed that can learn in polynomial time and also design an appropriate network. The algorithm is for classification problems and uses linear programing models to design and train the network. This paper summarizes the new algorithm, proves its stability properties, and provides some computational results to demonstrate its potential.
1 Introduction One of the critical issues in the field of neural networks is the develop ment of polynomial time algorithms for neural network training. With the advent of polynomial time methods (Karmarkar 1984; Khachian 1979 and others), linear programming has drawn increased attention for its potential for training neural networks in polynomial time (Glover 1990; Mangasarian et al. 1990; Bennett et al. 1992; Roy and Mukhopadhyay 1991; Roy et al. 1992). This paper presents the method of Roy and Mukhopadhyay (1991)and Roy et al. (1992) in summary form and proves its stability properties under translation and rotation of data points. Application of the method to some well-known learning problems is also shown. 2 A Linear Programming Method for Neural Network Generation -
The following notation is used henceforth. An input pattern is represented by the N-dimensional vector x , x = (XI,X2,.. . ,XN).The pattern space, which is the set of all possible values that x may assume, is represented by 0,. K denotes the total number of classes. The method is for supervised learning where the training set xl, x2, . . . ,x, is a set of sample patterns with known classification. Neural Computation 5,317-330 (1993) @ 1993 Massachusetts Institute of Technology
S. Mukhopadhyay et al.
318
The basic idea of this method is similar to the hypersphere method of Reilly et al. (1982), where a class region is "covered" by a set of hyperspheres of varying size. This method, however, generates the "covers" in a completely different way so as to obtain a polynomial time algorithm. Any complex nonconvex region can be covered by a set of elementary convex forms of varying size, such as hyperspheres and hyperellipsoids. Nonconvex covers can also be used when there is no problem in doing so. Let p elementary covers or masks (henceforth generally referred to as masks) be required to cover the region of a certain class P. To classify an input pattern as being in class P, it is necessary to determine if it falls within the area covered by one of the p masks. If the pattern space is two-dimensional and one of the p masks is a circle centered at ( a , b) with a radius r, a simple masking functionf (XI,X 2 ) = t.2 - [(XI-a)2 (X,can determine if an input pattern is a member of this mask. If (Xi, X;) is an input pattern, and
+
if f(Xi, Xi)2 0, then (Xi,Xi) is inside this mask and (Xi, Xi) belongs to class P; if f ( X { ,Xi) < 0, then (Xi Xi) is not inside this mask. In the learning phase, the procedure actually requires f ( x ) to be at least slightly positive [ f ( x ) 2 E] for membership in the mask and at least slightly negative v ( x ) 5 -61 for nonmembership. The membership criteria can be relaxed to f ( x ) 2 -6 in the testing phase, if warranted by numerical accuracy considerations. So, in general, let p k be the number of masks required to cover a class k, k = 1,. . . K. Let f f ( x ) , . . . , f L ( x ) denote these masking functions for class k. Then an input pattern x' will belong to class j if and only if one or more of its masks is at least slightly positive, and the masks for all other classes are at least slightly negative. Here, each mask will have its own threshold value E as determined during its construction. Expressed in mathematical notation, an input pattern x' is in class j , if and only if
' x ' ) >-. j
fi(
for at least one mask i,i = 1,.. . , p i
and
fik(x') < -6;
for all k # j and i = 1 . . . p k
(1)
If all masks are at least slightly negative (i.e., below their individual - E thresholds), the input pattern cannot be classified. If masks from two or more classes are at least slightly positive, then also the input pattern cannot be classified, though an indication can be given about the possible contenders. Unlike the hypersphere method of Reilly et al. (19821, the standard masking function (if there is to be one) of this method can possibly be any
Polynomial Time Algorithm
319
function that is linear in terms of the parameters to learn. For example, the following functions of the input vector x:
i=l
i=l
i=l j=i+l
and
i=l
i=l i=i+l
are acceptable masking functions since they are linear functions of the parameters to learn-the UP, bjs, cijs, and d. In experimental studies of this procedure, a quadratic function, as shown in 2, has been used as the standard mask. A quadratic function is able to generate many shapes, such as hyperspheres and hyperellipsoids. It can also generate nonconvex shapes, which is acceptable as long as they help to cover a class region properly.
2.1 Generating Masks Via Linear Programming. The procedure is outlined on the basis of using a standard mask. But that need not be the case. A priori, it is not known how many masks will suffice for any of the class regions. Thus, the first attempt is to define a single mask that will cover the whole class region. If that fails, the sample patterns in that class are generally split into two or more clusters by using a clustering procedure that produces a predetermined number of clusters and then attempts are made to define separate masks for each of these clusters. If that should fail, or if only some are masked, then the unmasked clusters are further split for separate masking until masks are provided for each ultimate cluster. The general idea is to define as large a mask as possible to include as many of the sample patterns within a class in a given mask as is feasibly possible, thereby minimizing the total number of masks required to cover a given region and obtaining the best generalization. When it is not feasible to cover with a certain number of masks, the region is subdivided into smaller pieces and masking is attempted for each piece. That is, the unmasked sample patterns are successively subdivided into smaller clusters for masking. At any stage of this iterative procedure, there will be a number of clusters to be masked. It might be feasible to mask some of them, thereby necessitating the breakup only of the remaining unmasked clusters. This “divide and conquer” procedure is heuristic. One can explore many variations of it.
S. Mukhopadhyay et al.
320
X"
x,'
Figure 1: Masking functions generate a multilayer perceptron. The feasibility of covering a set of sample patterns S, of class i with a mask f(x) is determined by solving the following linear program (LP): Minimize E s.t. f(xp) 2
f(xp) 5 E
2
E
-6
(4)
for all pattern vectors xp in the given set Si to be masked (xP E Si) for all pattern vectors xp in classes other than class i a small positive constant
If the LP solution is feasible and optimal, masking of set Si is complete and the LP solution to the parameters of f(x) defines the mask. If the LP is infeasible, masking of pattern set Si with mask f(x) is not feasible and it must be broken up for feasible masking with maskf(x). The infeasibility of an LP can be determined in polynomial time. 2.2 Constructing a Multilayer Perceptron from the Masks. The masking procedure actually generates a multilayer perceptron. Figure 1 shows how a multilayer perceptron is constructed from the masking functions when quadratics are used as masks. Suppose class A has k masks and
Polynomial Time Algorithm
321
class B has p. Each mask is evaluated in parallel at nodes Al through Ak and B1 through B,. For a given input pattern, the output of a node is 1 if the corresponding mask is at least slightly positive (2 and zero otherwise. A hard limiting nonlinearity (linear threshold unit) is used at these nodes. Class A hidden nodes A, through Ak are connected to the final output node A for the class and likewise for class B. The output of node A is 1 if at least one of the inputs is 1, it is zero otherwise, and likewise for node B. Again, hard limiting nonlinearities are used at these output nodes. An input pattern is in class A if the output of node A is 1 and that of node B is zero and vice versa. The masking function coefficients correspond to the connection weights and are placed on the connections between the input nodes and hidden layer nodes, as shown in the figure. The higher order product and power terms have been shown as direct inputs to the network. Actually, one more layer is needed at the input end to compute these higher order terms. In this constructive procedure, each hidden node is generated in an incremental fashion-there is no predesigned, fixed net within which learning takes place. Unlike classical connectionist learning, learning here is based on complete access to available information and on the ability to explicitly compare cases (by means of constraints). The net generated is shallow, is allowed to grow laterally only, and learning takes place in a single layer. As can be seen, the masking procedure constructs a restricted high-order polynomial network (Giles and Maxwell 1987) that is allowed to grow laterally in the hidden layer. This incremental growth of the net is similar in spirit to the GMDH method (Farlow 1984) that adds nodes when necessary.
4)
2.3 Outliers in Classification Problems. Many classification problems, by their very nature, generate patterns that fall outside their core class regions. All classification systems attempt to extract the core regions of each class. Because outliers are not identified a priori in the training set, their weeding out has to be performed either before or in conjunction with drawing of the boundaries. In this procedure, some weeding out is performed prior to and some during masking. Weeding out some outliers prior to masking is done in the following way. A clustering procedure that can produce exactly k clusters, such as K-means and hierarchical clustering, is used to divide the training set into k small groups (clusters) and all minority class members in each group are discarded as outliers. This procedure, in essence, assigns a small neighborhood, as represented by one of the k clusters, to the class of its majority members. The process of breaking up core class regions into smaller pieces, cleansing them of outliers, and then masking them is, in effect, an attempt to minimize classification error. The weeding out is actually performed in three steps with the average cluster size
S. Mukhopadhyay et al.
322
gradually decreasing to settle the allocation of unresolved neighborhoods. Sensitivity analysis is also performed in the first step. For example, if the training set has 700 samples, a clustering procedure can be used to divide the set into 100 clusters of average size 7. Suppose one such cluster has 5 class A and 2 class B patterns. The 2 class B patterns are in a minority in the cluster and are put in a candidate list of outliers. Suppose another cluster has 4 class A and 4 class B patterns. Since neither class commands a majority, the cluster cannot be assigned to any class. Such clusters are collected and broken up into smaller clusters in the next step. In the first step, the number of breakup clusters k is varied to test the sensitivity of the outlier diagnosis. Thus the training set is broken up twice into a different number of clusters and the consistency of each outlier diagnosis verified. For example, the training set of 700 samples may first be broken up into 80 clusters and an outlier candidate list developed. The same set may then be broken up into 120 clusters and a second outlier candidate list developed. Only those patterns that appear on both lists are discarded as outliers. In the second step, the unassigned clusters of step 1 are split into smaller clusters and a relaxed majority rule (> 50% membership defines majority class) used for territory allocation. So, for example, after the first step, out of the original 700 patterns, there might be 50 patterns unassigned, 70 thrown out as outliers and the remaining 580 retained for masking. In step 2, the 50 unassigned patterns may be split into 10 clusters of a reduced average size of 5. Suppose one such cluster has 3 class A and 2 class B patterns. The 2 class B patterns define a minority class and are discarded as outliers. The 3 class A patterns are retained for masking. In step 3, clusters can be arbitrarily assigned to a class if there is a membership tie-nothing remains unassigned. The LP algorithm is summarized below. Further details are available in Roy and Mukhopadhyay (1991) and Roy et al. (1992), including a proof of its polynomial time convergence. 2.4 The Algorithm. Phase I: Weed Out Outliers
step 1
1. Break up the training set into two different cluster sets A and B, each consisting of a different number of clusters, using a clustering procedure that can produce a predetermined number of clusters, such as K-means and hierarchical clustering. (The number of breakup clusters is determined, in turn,by the average cluster size chosen. We generally vary it between 6 to 8 to check sensitivity.) 2. In each cluster set, classify a pattern as an "outlier" if its class has less than one-third of the members, as a "core pattern" if its class has at least a two-thirds majority, and as "unassigned" otherwise.
Polynomial Time Algorithm
323
3. Compare these classifications across the cluster sets A and B. If a pattern is classified as a “core pattern” in one set and as an “outlier” in another, reclassify it as a “core pattern.” All other inconsistently classified patterns are reclassified as “unassigned.” 4. “Unassigned” patterns are carried over to step 2, the ”outlier” pat-
terns discarded, and the ”core patterns” retained for masking in phase 11. step 2
1. Break up the remaining “unassigned” patterns into smaller sized clusters using a clustering procedure that can produce k clusters and classify patterns as ”outlier,” “core pattern,” or “unassigned using a relaxed majority rule (over 50% only) for the ”core pattern” classification. A pattern is an “outlier” when its class possesses less than 50% of the members and is “unassigned” otherwise. 2. “Unassigned” patterns are carried over to step 3, the “outliers”
discarded and the “core patterns” retained for masking in phase 11.
Step 3 Repeat step 2 with the remaining “unassigned” patterns, splitting them into smaller sized clusters. In this step, the ”simple majority” rule is used to classify “core patterns,” classification ties are resolved arbitrarily, and no patterns remain “unassigned.” Phase ZI: Construct Masking Functions 0. For each class i, i = 1,...K, perform the following two steps: 1. For each unmasked cluster of class i, set up the LP in (4). Initially, a class has only one unmasked cluster consisting of all the “core patterns” from phase I, unless that is split up to start with. Solve the LPs. If all LP solutions are feasible and optimal, masking of class i is complete; go to step 0 for next class. Otherwise, when some or all LPs are infeasible, save all feasible LP solutions (masks), if any, and go to step 2. 2. Collect all unmasked clusters of class i and split them into smaller
clusters using a clustering procedure. Discard all resulting clusters that are very small in size (e.g., has less than 2% of the total sample patterns) as “outliers.” Return to step 1. If the problem has no noise, phase I can be skipped. Outliers remaining after phase I can cause some mask break up. Phase I1 can be rerun as a cleanup pass to obtain bigger masks producing better generalization. In this procedure, the basic purpose of clustering is to dissect the data and not to uncover “real” clusters.
S. Mukhopadhyay et al.
324
2.5 Stability Properties. To show the translation and rotation invariance properties of this method, the following have to be demonstrated: (1)stability of the steps of the algorithm that use clustering, and (2) stability of the linear programming solutions. Consider the following related pair of problems: Problem I .
set.
Minimize c f ( x p ) 2 p E GI f (xp) I - 6 , p E GZ E 2 a small positive constant €7
(5)
Problem Il. s.t.
Minimize E f ( x i R t ) 2 E , p E G1 f($R t ) I - E , P E Gz E 2 a small positive constant
+ +
(6)
where R is a rotation matrix, t a translation vector, GI the set of pattern vectors to be masked, GZ the set of pattern vectors belonging to classes other than that of GI, and f ( x p ) a linear mask. It is assumed that R is nonsingular and that the transpose of a rotation matrix is also its inverse. The following stability results are proven for a quadratic mask, f ( x ) = xTAx bTx c, where A is an N x N matrix, b a vector of size N, and c a scalar. Similar stability results can be shown for other linear masks.
+ +
Stability Theorem 1. The optimum objective function values for problems I and 11, when they are feasible, are the same and the solutions to the quadratic masks are equivalent. Proof. Let
A2 = RA1R-I b: = bfR-' - 2tTA2 cz = PA2t - bfR-'t + CI €2
=
€1
(7)
(8) (9) (10)
It is shown that if the solution (Al,b l , c l , q ) is optimal for problem I, then (Az,bz, CZ, Q), as defined in 7-10, is optimal for problem 11, and if (A2,bzrc2,c2) is optimal for 11, then ( A l ,bl,cl, q ) is optimal for I. Given the assumed optimality of the solutions ( A I b, l , c1, €1) and (Az,b2, c2, €2) for the respective problems (I and 111, they must also be feasible for these two problems. By substituting the solution 7-10 in problem 11, one obtains €1 2 €2, and by similar substitution in problem I, one obtains q 5 FZ. 0 Consequently, = €2 and the stated conclusions follow at once.
Corollary: Stability of Infeasible LPs. When problem I is infeasible, so is problem I1 and vice versa.
Polynomial Time Algorithm
325
Proof. Follows directly from Stability Theorem 1 by means of contradiction. If one problem is feasible, then so is the other. CI Rotation and/or translation of data points do not affect the relative distances between them. If the distance matrix remains unchanged, any hierarchical clustering method used to produce k clusters will produce the same set of clusters (Everitt 1980; Hartigan 1975), independent of rotation and/or translation of data points. Hence, if the algorithm generates the same set of pattern vectors for the clustering step at every stage of phases I and 11, and if they are split into the same number of clusters, the resulting clusters at each stage will be identical. Stability Theorem 2. The masking and clustering outcomes, when a hierarchical clustering method is used to produce k clusters, are unaffected by any rotation andlor translation of the training set, when all other conditions of the algorithm remain unchanged. Proof. Follows directly from theorem 1, its corollary, and from the obser0 vations on clustering outcomes noted above. 3 Computational Results
All results have been obtained from an implementation of this algorithm on the SAS system (SAS Manual 1988). The problems were solved on an IBM 3090 operating under the MVS operating system. For clustering, the average linkage method of hierarchical clustering (Sokal and Michener 1958) was used. 3.1 The Parity Problem. Rumelhart et al. (1986) tried the parity problem ranging in size from two to eight. They used a single hidden layer architecture which requires at least N hidden units to solve parity with N inputs. Muhlenbein (1990) reports that the bp algorithm never converged for N 2 6 with N hidden units, but converged when provided with 2N hidden units. Tesauro and Janssens (1988) also used 2N hidden units to overcome the local minimum problem. Table 1 shows the clustering and LP solution times for this procedure. Since it is a two-class problem, only one class was masked. Phase I was not used since there is no noise in this problem. The table shows that the algorithm is extremely fast. 3.2 The Symmetry Problem. Rumelhart et al. (1986) discovered that it can always be solved with only two hidden units in a single layer of hidden units. Table 2 shows the results of this procedure on the symmetry problem. Since it is a two-class problem, only one class was masked. It is solved in all cases with a single hidden unit (mask) and the LP solution times are close to zero. A masking function with linear and square terms only was used.
S. Mukhopadhyay et al.
326
Table 1: Solution Times for the Parity Problem
N, No. of bits
Clustering time (sec)
2 3 4 5 6 7 8
0 0.05 0.05 0.16 0.65 1.95 5.85
No. of masking functions
LP time (sec) 0 0 0 0 0 13 40
1 2 2 3 7 14 28
Table 2: Solution Times for the Symmetry Problem N, Clustering LP No. of No. of bits time (sec) time (sec) masking functions
2 3 4 5 6 7 8
-
0 0
0 0 0 0
0
1 1 1 1 1 1 1
3.3 Overlapping Gaussian Distributions. An obvious question about this method is, how well would the outlier detection heuristic work on classes with dense overlap? To test the heuristic under those circumstances, the following two problems were set up.
Problem I: The I-I Problem. A simple two-class problem where both classes are described by gaussian distributions with different means, and with covariance matrices equal to identity I. A four-dimensional problem with mean vectors [OOOO]and [1111] was tried. The Bayes error is about 15.2% in this case. Problem ZI: The I 4 Problem. A two-class problem where both classes are described by gaussian distributions with zero mean vectors and covariance matrices equal to I and 41. The optimal classifier here is quadratic, and Bayes error for a four-dimensional problem is 17.64% and for an eight-dimensional problem is 9%. Both problems were tried with randomly generated training sets of different sizes. Tables 3 and 4 show the results. Since both are two-class
12
10 13
16 16 16
23 27 27
M = ml,m2,m3
M = 6,.5,3
M = 7,5,3 M = 8,S,3
M=6,S M = 7,s M=8,5
M =6,s M =7 3
M = 8,s
phase I
No. of outliers found in phase I
Cluster sizes
Clustering + LP solution time (sec)
10.1 10.4 10.1 11.98 11.98 11.98
20
20
0 0
76.76
23
Number of Training Patterns (n = 180)
0 0 0
Number of Training Patterns (n = 120)
0 0 0
1 1 1
1 1 1
1 1 1
hCti0nS
No. of Masking
Number of Training Patterns (n = 60)
found in phase 11
No. of outliers
Table 3 LP Solution Ties and Results for Problem I (Bayes Error-15.2%)
21 17.1 17.1
18.6 18.6 18.6
23.6 20 2.5
Error (%)
17.8
17
17
Error (%) of SAS linear disaiminant function
M=7,5 M = 8,5
M=6,5
M = 7.5 M = 8,5
M = 6,5
Cluster sizes M = ml,m2,m3 phase I No. of outliers found in phase I1
Clustering + LP solution time (sec) No. of Masking functions
0 0
0 18.44
18.4 18.52
38 49 63 15 20 3
837.8 379.5 484.6
1 1 1
Number of Training Patterns (n = 4001,eight dimensions (Bayes error, 9%)
29 29 29
10.9 12.9 12
19 19 19
Error (%)
Number of Training Patterns ( n = 180),four dimensions (Bayes error, 17.64%)
No. of outliers found in phase I
Table 4 LP Solution limes and Results for Problem II
12.5
22.08
Error (%) of SAS linear discriminant function
Polynomial Time Algorithm
329
problems, only one of the classes need be masked. As shown, a single hidden node (mask) suffices for all cases. For both problems, the error rate generally decreases as the training set size is increased and tends to the theoretical minimum Bayes error rate. A randomly generated test set of 400 examples was used for each problem. The table entries M = 7, 5, 3, etc. show the average cluster size that was used in each step of the three-step phase I procedure. Tables 3 and 4 also show the error rates of the SAS linear discriminant function on these same problems. For problem 11, the SAS linear discriminant function was provided with squares of the input values as additional inputs. 4 Conclusions
This paper describes a new algorithm that uses linear programming to find the weights of a network. Its advantages are that it can both design and train a net in polynomial time. The network design issue has often been overlooked by the neural network community. But a true neural network algorithm should be able to both design and train a network in polynomial time. This classifier, however, does not provide estimates of Bayesian probabilities and is currently implemented using SAS.
Acknowledgment This research was supported by the National Science Foundation Grant IRI-9113370.
References Bennett, K. P., and Mangasarian, 0. L. 1992. Neural network training via linear programming. In Advances in Optimization and Parallel Computing, P. M. Pardalos, ed. North Holland, Amsterdam. Everitt, B. S. 1980. Cluster Analysis, 2nd ed. Heinemann Educational Books, London. Farlow, S. 1984. Self-organizing Methods in Modeling. Marcel Dekker, New York. Giles, C. L., and Maxwell, T. 1987. Learning, invariance, and generalization in high-order networks. Appl. Optics 26(4),972-974,978. Glover, F. 1990. Improved linear programming models for discriminant analysis. Decision Sci. 21(4), 771-785. Hartigan, J. A. 1975. Clustering Algorithms. John Wiley, New York. Karmarkar, N. 1984. A new polynomial time algorithm for linear programming. Combinatorica 4,373-395. Khachian, L. G. 1979. A polynomial algorithm in linear programming. Dokl. Akad. Nauk SSR 244(5), 1093-1096; Soviet Math. Dokl. 20, 191-194.
330
S. Mukhopadhyay et al.
Mangasarian, 0. L., Setiono, R., and Wolberg, W. H. 1990. Pattern recognition via linear programming: Theory and application to medical diagnosis. In Large-Scale Numerical Optimization, T. F. Coleman and Y. Li, eds., pp. 22-30. SIAM, Philadelphia. Miihlenbein, H. 1990. Limitations of multilayer perceptrons-Steps towards genetic neural networks. Parallel Comput. 14(3), 249-260. Reilly, D. L., Cooper, L. N., and Elbaum, C. 1982. A neural model for category learning. Biol. Cybern. 45, 35-41. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in Microstructure of Cognition, Vol. 1: Foundations, D. E . Rumelhart and J. L. Mdlelland, eds., pp. 318-362. MIT Press, Cambridge, MA. Roy, A., and Mukhopadhyay, S. 1991. Pattern classification using linear programming. ORSA J. Comput. 3(1), 66-80. Roy, A,, Kim, L. S.,and Mukhopadhyay, S. 1992. A polynomial time algorithm for the construction and training of a class of multilayer pemeptrons. Neural Nefwrks, in press. SAS Institute Inc. 1988. SAS Manual, Cary, NC. Sokal, R. R., and Michener, C. D. 1958. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 38, 1409-1438. Tesauro, G., and Janssens, R. 1988. Scaling relationships in backpropagation learning. Complex Syst. 2, 39-44. Received 1 November 1991; accepted 1 September 1992.
This article has been cited by: 2. Y. Takahashi. 2000. A mathematical solution to a network construction problem. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 47:2, 166-184. [CrossRef] 3. Ling Zhang, Bo Zhang. 1999. A geometrical representation of McCulloch-Pitts neural model and its applications. IEEE Transactions on Neural Networks 10:4, 925-929. [CrossRef] 4. C. Citterio, A. Pelagotti, V. Piuri, L. Rocca. 1999. Function approximation-fast-convergence neural approach based on spectral analysis. IEEE Transactions on Neural Networks 10:4, 725-740. [CrossRef] 5. J. Manuel Torres Moreno, Mirta B. Gordon. 1998. Efficient Adaptive Learning for Classification Tasks with Binary UnitsEfficient Adaptive Learning for Classification Tasks with Binary Units. Neural Computation 10:4, 1007-1030. [Abstract] [PDF] [PDF Plus]
Communicated by Fernando Pineda
Neural Networks for Optimization Problems with Inequality Constraints: The Knapsack Problem Mattias Ohlsson Carsten Peterson Bo Soderberg Department of Theoretical Physics, University of Lund, Shegatan 14A, $22362 Lund, Sweden
A strategy for finding approximate solutions to discrete optimization
problems with inequality constraints using mean field neural networks is presented. The constraints x I 0 are encoded by xO(x) terms in the energy function. A careful treatment of the mean field approximation for the self-coupling parts of the energy is crucial, and results in an essentially parameterfree algorithm. This methodology is extensively tested on the knapsack problem of size up to lo3 items. The algorithm scales like NM for problems with N items and M constraints. Comparisons are made with an exact branch and bound algorithm when this is computationally possible (N I 30). The quality of the neural network solutions consistently lies above 95% of the optimal ones at a significantly lower CPU expense. For the larger problem sizes the algorithm is compared with simulated annealing and a modified linear programming approach. For “nonhomogeneous“ problems these produce good solutions, whereas for the more difficult “homogeneous” problems the neural approach is a winner with respect to solution quality and/or CPU time consumption. The approach is of course also applicable to other problems of similar structure, like set covering. 1 Background
Feedback artificial neural networks (ANN)have turned out to be powerful in finding good approximate solutions to difficult combinatorial optimization problems (Hopfield and Tank 1985; Peterson and Soderberg 1989; Peterson 1990; Gislkn et al. 1989, 1991). The basic procedure is to map the problems onto neural networks of binary (Ising spin) or K-state (Potts spin) neurons with appropriate choice of energy functions, and then to find approximate minima of the energy using mean field theory (MFT) techniques. In this way essentially “black box” procedures emerge. The application areas dealt with in Hopfield and Tank (1985) and Peterson and Siiderberg (1989), Gislkn et al. (1989,1991) (traveling salesman, Neural Computation 5,331-339 (1993) @ 1993 Massachusetts Institute of Technology
M. Ohlsson, C. Peterson, and B. Soderberg
332
graph partition and scheduling) are characterized by global equality constraints, which can be implemented as quadratic penalty terms. These contain self-interaction parts (diagonal terms), which can be balanced by counterterms to assure reliable MFT dynamics. However, in many real-world optimization problems, in particular those of resource allocation type, one has to deal with inequalities. The objective of this work is to develop a mapping and MFT method to deal with this kind of problem. As a typical resource allocation problem we choose the knapsack problem for our studies. Although artificial, we feel it is a realistic enough test bed. A crucial ingredient in our approach is to avoid self-couplings by a proper MFT implementation of the constraint terms. 2
T h e Knapsack Problem
In the knapsack problem one has a set of N items i with associated utilities ci and loads ski. The goal is to fill a "knapsack" with a subset of the items such that their total utility, N
u = ccisj
(1)
i=l
is maximized, subject to a set of M load constraints, N
defined by load capacities bk. In equations 1 and 2 si are binary ( 0 , l ) decision variables, representing whether or not item i goes into the knapsack. The variables (ci, ski, and bk) that define the problem are all real numbers. We will consider a class of problems, where akj and ci are independent uniform random numbers on the unit interval, while bk are fixed to a common value b. With b = N / 2 , the problem becomes trivial-the solution will have almost all si = l. Conversely, with b << N/4, the number of allowed configurations will be small and an exact solution can easily be found. We pick the most difficult case, defined by b = N/4. The expected number of used items in an optimal solution will then be about N / 2 , and an exact solution becomes inaccessible for large N. In the optimal solution to such a problem, there will be a strong correlation between the value of c; and the probability for Sj to be 1. With a simple heuristic based on this observation, one can often obtain near-optimal solutions very fast. We will therefore also consider a class of more difficult problems with more homogeneous ci distributions-the extreme case is when ci are constant, and the utility proportional to the number of items used. We note in passing that the set covering problem is a special case of the general problem, with random aki E (0, l}, and bk = 1. This defines a
Optimization Problems with Inequality Constraints
333
Figure 1: (a) The sigmoid g(x; T) of equation 4. (b) The penalty function x 0 ( x ) of equation 5. comparatively simple problem class, according to the above discussion, and we will stick to the knapsack problem in what follows. 3 Neural Network Formulation and Solution Strategy
3.1 Neural Mapping. We start by mapping the problem defined in equations 1 and 2 onto a generic neural network energy function E,
where ip is a penalty function to ensure that the constraint in equation 2 is fulfilled. The coefficient Q governs the relative strength between the utility and constraint terms. For equality constraints an appropriate choice of @ ( x ) would be @ ( x ) = x2. Having inequalities we need a @ ( x ) that penalizes only configurations for which x 2 0. One possibility is to use a sigmoid, @ ( x ) = g(x; 7') (see Fig. la), I
g(x; T ) = z[l+ tanh(x/T)]
(4)
This option has the potential disadvantage that the penalty is the same (= 1) no matter how badly the constraints are violated. An alternative that gives penalty in proportion to the degree of violation is @(x) =xQ(x)
(5)
This function (see Fig. lb) has the additional advantage that no extra parameter like the temperature T in the sigmoid is needed. The slope of ip is implicitly given by Q in equation 3. The xO(x) alternative consistently gives better performance and is used throughout this paper.
M. Ohlsson, C. Peterson, and B. Soderberg
334
3.2 Mean Field Dynamics. We want to minimize equation 3 with the mean field approximation (MET), which has turned out to be very powerful for other optimization problems (Peterson and Soderberg 1989; Peterson 1990; GislCn et al. 1989, 1991). Due to the nonpolynomial form of the constraint terms (equations 4 and 5 ) special care is needed when implementing the MFT approximation. Recall that the MFT approximation consists of replacing the binary variables si with mean field variables at temperature T, vi = ( s i ) ~and solving the MFT equations,
by iteration. In problems with equality constraints implemented by quadratic penalty terms the diagonal pieces are compensated for by adding appropriate self-coupling terms. Such a procedure is not trivial in this case of strongly nonlinear constraint penalties. Instead, we avoid self-couplings altogether, by replacing h’E/dvi with the difference in E computed at vi = 1 and vi = 0, respectively. One obtains
Equations 6 and 7 are solved iteratively by annealing in T. To avoid small final constraint violations, we employ a progressive constraint term, (Y c( 1/T. This means that the slope of xO(x) increases during convergence. We will present a standardized scheme below when testing the algorithm numerically. The number of computational steps for solving equations 6 and 7 scales like NMn7, where n, is the number of iterations needed for convergence, which turns out to be fairly problem size independent [as was observed in other MET approaches to optimization problems (Peterson and Soderberg 1989; GislCn et al. 1989, 199111. A factor N has been gained by “recycling” the sums appearing in the argument of @these are saved and need not be completely recomputed for each update of vi. 3.3 High-T Fixpoints and Critical Temperature. At a high temperature T, the system will approach a fixpoint with all vi close to 1/2 (see Fig. 2). With random aki and ci on [0,1],and fixed bk = b, two distinct types of high-T behavior emerge.
With b well above b d t E N/4, all constraints are safe at high T, and the system is stuck at a trivial fixpoint, Vj = g(Ci; T ) . 0
With b instead well below befit, all constraints are violated at high T, and the trivial fixpoint is instead Vj = g(Cj - a! Ck ah; T ) .
Optimization Problems with Inequality Constraints
Figure 2 Evolution of rand[O.45,0.55].
{ v i } for an
N
335
= M = 40 knapsack problem with c, =
In both cases a statistical analysis shows that vi remain close to 1/2 for
Thus, in the case at hand of b
=
bht, a suitable starting point for the
annealing will be T x 10. 4 Other Approaches
To see how well our MlT algorithm works we need to compare it with other approaches. For reasonably small problem sizes it is feasible to use an exact algorithm, branch and bound, for comparison. For larger problem sizes, one is confined to other approximate methods, simulated annealing (Kirkpatrick et al. 19831, greedy heuristics, and linear programming based on the simplex method (Press et al. 1986). Branch and Bound (BB): The knapsack problem is NP-complete, and the effort to find the optimal solution by brute force scales like 2N. Using a branch and bound tree search technique one can reduce the number of computational steps. This method consists in going down a search tree,
M. Ohlsson, C. Peterson, and 8. Soderberg
336
checking bounds on constraints or utility for subtrees, thereby avoiding unnecessary searching. In particular for nonhomogeneous problems, this method is accelerated by ordering the cis according to magnitude: c1
> c2 >
"'
> CN
(9)
For problems where the constraints are "narrow" (b not too large) this method can require substantially lower computation needs. However, it is still based on exploration and it is only feasible for problem sizes less than M = N FZ 30 - 40. Greedy Heuristics (GH): This is a simple and fast approximate method for a nonhomogeneous problem. Proceeding from larger to smaller ci (cf. equation 91, collect every item that does not violate any constraint. This method scales like NM. Simulated Annealing (SA): Simulated annealing (Kirkpatrick et al. 1983) is easily implemented in terms of attempted single-spin flips, subject to the constraints. Suitable annealing rates and other parameters are given below. This method also scales like NM times the number of iterations needed for thermaliation. Linear Programming with Greedy Heuristics (LP):Linear programming based on the simplex method (Press et al. 1986) is not designed to solve discrete problems like the knapsack one. It does apply, however, to a modified problem with si E [0,l].For the ordered (equation 9) nonhomogeneous knapsack problem this gives solutions with a set of leading Is and a set of trailing Os, with a window in between containing real numbers. Augmented by greedy heuristics for the elements in this window, fairly good solutions emerge. The simplex method scales like
N2M. 5 Numerical Comparisons
Neural Network ("1: Convenient measures for monitoring the decision process are the saturation C = (4/N) Ci(v;- 0.5)2and the evolution rate A = (1/N) Ci(Avi)2,where Avi = vi(t At) - vi(t). The saturation starts off around 0 at high temperature T, and increases to 1 in the T -, 0 limit. We have chosen an annealing schedule where TO= 10, T , = kT,-I, where k = 0.985 if 0.1 < C < (N - l ) / N and 0.95 otherwise. At each temperature every neuron is updated once. We employ a progressive constraint coefficient, a = O.l/T, to avoid small final constraint violations. The algorithm is terminated when C > 0.999 and A < 0.00001. Should the final solution violate any constraint (which is very rare), the annealing is redone with a higher a. In Figure 2 we show a typical evolution of {v;} for an N = M = 40 problem. Simulated Annealing (SA): The performance of this method depends on the annealing schedule. To compare the performance of this method with that of the neural network approach we have chosen the parameters
+
Optimization Problems with Inequality Constraints
337
Table 1: Comparison of Performance and CPU Time Consumption for the Different Algorithms on an N = M = 30 problem! c; = rand[O,11
Algorithm
Perf.
BB
1
NN SA
0.98 0.98 0.98 0.97
LP GH
CPU time 16 0.80 0.80 0.10
0.02
c; = rand[0.45,0.55] Perf. CPU time 1 1500 0.95 0.70 0.95 0.80 0.93 0.25 0.88 0.02
C; = 0.5
Perf. 1 0.97 0.96 0.93 0.85
CPU time 1500 0.75
0.80 0.30 0.02
'The CPU consumption refers to seconds on a DEC3100 workstation.
such that the time consumption of the two methods is the same. This is accomplished with TO = 15, Tfinal= 0.01 and annealing factor k = 0.995. First we compare the NN, SA, and LP approaches with the exact BB for an N = M = 30 problem. This is done both for nonhomogeneous and homogeneous problems. The results are shown in Table 1. As expected LP and in particular GH benefit from nonhomogeneity both quality- and CPU-wise, while for homogeneous problems the NN algorithm is the winner. For larger problem sizes it is not feasible to use the exact BB algorithm. The best we can do is to compare the different approximate approaches, NN, SA, and LP. The conclusions from problem sizes ranging from 50 to 500 are the same as above. The real strength in the NN approach is best exploited for more homogeneous problems. In Figures 3 and 4 we show the performance and CPU consumption for N E [50,500] with M = N. 6 Summary
We have developed a neural mapping and MFT solution method for finding good solutions to combinatorial optimization problems containing inequalities. The approach has been successfully applied to difficult knapsack problems, where it scales like NM. For the difficult homogeneous problems the MFT approach is very competitive as compared to other approximate methods, both with respect to solution quality and time consumption. It also compares very well with exact solutions for problem sizes where these are accessible. In addition, the MFT approach of course has the advantage of being highly parallelizable. This feature was not explored in this work. In Vinod et al. (1990) an ANN approach different from ours was applied to the knapsack problem. The idea in Vinod et al. (1990) is to make orthogonal projections onto convex sets. Since the difficult parameter region was not explored there, a numerical comparison would not be meaningful.
M. Ohlsson, C. Peterson, and B. Werberg
338
I
U
Figure 3: Performance of the neural network (NN)and linear programming approaches (LP)normalized to simulated annealing (SA) for problem sizes ranging from 50 to 500 with M = N. (a) Ci = rand[0.45,0.55]and (b) ci = 0.5.
Figure 4: CPU consumption of the neural network (NN) and linear programming approaches(LP) normalized to simulated annealing (SA) for problem sizes ranging from 50 to 500 with M = N. (a) ci = rand[O.45,0.55] and 6)Ci = 0.5. The numbers refer to DEC3100 workstations.
Optimization Problems with Inequality Constraints
339
Note: We recently became aware of a similar neural approach to the knapsack problem (Hellstrom and Kanal 1992), where the authors also use a discretized form of the derivative. Their treatment is confined to integer problems with a single constraint, whereas ours treat the more general case. Another difference is that the problems probed in Hellstrom and Kanal (1992) are in a nondifficult region (b << N/4) and of fairly small sizes (N= 20). References Gislkn, L., Peterson, C., and Soderberg, B. 1989. Teachers and classes with neural networks. Znt. J . Neural Syst. 1, 167. Gislkn, L., Peterson, C., and Soderberg, B. 1992. Scheduling high schools with neural networks. Neural Comp. 4,805. Hellstrom, B. J., and Kanal, L. N. 1992. Knapsack packing networks. I€€€ Trans. Neural Networks 3, 202. Hopfield, J. J., and Tank, D. W. 1985. Neural computation of decisions in optimization problems. Biol. Cybern. 52, 141. Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 671. Peterson, C. 1990. Parallel distributed approaches to combinatorial optimization. Neural Comp. 2, 261. Peterson, C., and Sijderberg, 8.1989. A new method for mapping optimization problems onto neural networks. Int. J. Neural Syst. 1, 3. Press, W. P., Flannery, B. P., Teukolsky, S. A., and Vettering, W. T. 1986. Numerical Recipes, The Art of Scientific Computing. Cambridge University Press, Cambridge. Vinod, V. V., Ghose, S., and Chakrabarti, P. P. 1990. Resultant projection neural networks for optimization under inequality constraints. Kharagpur Department of Computer Science Preprint.
Received 9 April 1992; accepted 2 September 1992.
This article has been cited by: 2. Arnaud Fréville, SaÏd Hanafi. 2005. The Multidimensional 0-1 Knapsack Problem—Bounds and Computational Aspects. Annals of Operations Research 139:1, 195-227. [CrossRef] 3. Youshen Xia . 2004. An Extended Projection Neural Network for Constrained OptimizationAn Extended Projection Neural Network for Constrained Optimization. Neural Computation 16:4, 863-883. [Abstract] [PDF] [PDF Plus] 4. Manfred Opper, Ole Winther. 2001. Adaptive and self-averaging Thouless-Anderson-Palmer mean-field theory for probabilistic modeling. Physical Review E 64:5. . [CrossRef] 5. Henrik Jönsson , Bo Söderberg . 2001. An Information-Based Neural Approach to Constraint SatisfactionAn Information-Based Neural Approach to Constraint Satisfaction. Neural Computation 13:8, 1827-1838. [Abstract] [PDF] [PDF Plus] 6. P. Persson, S. Nordebo, I. Claesson. 2001. Hardware efficient digital filter design by multimode mean field annealing. IEEE Signal Processing Letters 8:7, 193-195. [CrossRef] 7. P. Persson, S. Nordebo, I. Claesson. 2001. Multimode mean field annealing technique to design recursive digital filters. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 48:12, 1151-1154. [CrossRef] 8. M. Pelillo, K. Siddiqi, S.W. Zucker. 1999. Matching hierarchical structures using association graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 21:11, 1105-1120. [CrossRef] 9. A. Le Gall, V. Zissimopoulos. 1999. Extended Hopfield models for combinatorial optimization. IEEE Transactions on Neural Networks 10:1, 72-80. [CrossRef] 10. Jari Häkkinen , Martin Lagerholm , Carsten Peterson , Bo Söderberg . 1998. A Potts Neuron Approach to Communication RoutingA Potts Neuron Approach to Communication Routing. Neural Computation 10:6, 1587-1599. [Abstract] [PDF] [PDF Plus] 11. Xiaopeng Chen, K.M. Chugg, M.A. Neifeld. 1998. Near-optimal parallel distributed data detection for page-oriented optical memories. IEEE Journal of Selected Topics in Quantum Electronics 4:5, 866-879. [CrossRef] 12. B López, W Kinzel. 1997. Journal of Physics A: Mathematical and General 30:22, 7753-7764. [CrossRef] 13. Martin Lagerholm , Carsten Peterson , Bo Söderberg . 1997. Airline Crew Scheduling with Potts NeuronsAirline Crew Scheduling with Potts Neurons. Neural Computation 9:7, 1589-1599. [Abstract] [PDF] [PDF Plus] 14. Zhou Qingshan, Zou Yong, Hu Jiandong. 1997. A neural network approach to gate matrix layout. Journal of Electronics (China) 14:3, 209-214. [CrossRef] 15. Ibrahim H. Osman, Gilbert Laporte. 1996. Metaheuristics: A bibliography. Annals of Operations Research 63:5, 511-623. [CrossRef]
16. M. Aourid, B. Kaminska. 1996. Minimization of the 0-1 linear programming problem under linear constraints by using neural networks: synthesis and analysis. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 43:5, 421. [CrossRef] 17. Carsten Peterson, Ola Sommelius, Bo Söderberg. 1996. Variational approach for minimizing Lennard-Jones energies. Physical Review E 53:2, 1725-1731. [CrossRef] 18. K. Smith. 1996. An argument for abandoning the travelling salesman problem as a neural-network benchmark. IEEE Transactions on Neural Networks 7:6, 1542-1544. [CrossRef] 19. J F Fontanari. 1995. Journal of Physics A: Mathematical and General 28:17, 4751-4759. [CrossRef] 20. Roberto Battiti, Giampietro Tecchiolli. 1995. Local search with memory: benchmarking RTS. OR Spektrum 17:2-3, 67-86. [CrossRef] 21. E Korutcheva, M Opper, B Lopez. 1994. Journal of Physics A: Mathematical and General 27:18, L645-L650. [CrossRef] 22. Laurene V. FausettBoltzmann Machines . [CrossRef]
ARTICLE
Communicated by Richard Durbin
A Model for Motor Endplate Morphogenesis: Diffusible Morphogens, Transmembrane Signaling, and Compartmentalized Gene Expression Michel Kerszberg; Neurobiologie Cellulaire, Institut Pasteur, 25, rue du Docteur Roux, F-75724 Paris Cedex 15, France
Jean-Pierre Changeux Neurobiologie Moliculaire, CNRS VA D l 284, lnstitut Pasteur, 25, rue du Docteur Roux, F-75724 Paris Cedex 15, France
A mathematical model for the formation and maintenance of synaptic contacts at the motor endplate is proposed. It is based on diffusion between sarcoplasmic nuclei of limiting amounts of a morphogen substance. The morphogen is postulated to act on genetic switch-like intranuclear units and to regulate positively both the transcription of its o w n gene and that of acetylcholine receptor (AChR) subunit genes. The efficacy of autoregulation is assumed to be depressed by electrical activity; while AChR genes transcription is enhanced by anterograde neural factors. Thus the model involves Turing's classical ingredients: autocatalysis and short range activation by the morphogen, and long range inhibition by electrical activity. Our predictions include: the stabilization of a single, transcriptionally active nucleus located in the central region of the developing muscle fiber (or myotube); the frequent occurrence of transcriptional activity in nuclei at the tendinous ends; and the onset, upon denervation of adult muscle, of transcription waves, starting from both the central site and the tendinous nuclei. In noninnervated fibers, the calculations show that spontaneous, irregular electrical activity leads to a variety of near-periodic spatial patterns of transcription; these are also predicted in innervated fibers when the depressing effect of electrical activity is weak, giving rise to the stabilization of multiple endplates as occurs in muscles with distributed innervation.
'Present address: Neurobiologie Moleculaire, CNRS VA D1284, Institut Pasteur, 25, rue d u Docteur Roux, F-75724 Paris Cedex 15, France.
Neural Computation 5,341-358 (1993) @ 1993 Massachusetts Institute of Technology
342
Michel Kerszberg and Jean-PierreChangeux
1 Introduction
The molecular mechanisms by which synaptic contacts become established and stabilized in the nervous system remain a major issue of developmental biology. Connections must be adapted to their function and must remain so in spite of perturbations and molecular turnover: the latter suggesting that genetic control mechanisms are most certainly involved. The recent application to this question of recombinant DNA technology has opened the possibility to directly analyze the relevant processes in terms of the regulation of gene expression. In the present paper, we should like to propose a model for the operation of such mechanisms in the particular case of one of the best studied synaptic systems, the motor endplate that forms at the neuromuscular junction. At the adult motor endplate, the receptor for the neurotransmitter, acetylcholine (ACh), is densely accumulated under the motor nerve ending, while a few micrometers away from the synapse, its surface density falls rapidly. Such a privileged distribution of AChR molecules develops from an initial state of the embryonic muscle fiber where AChR molecules are dispersed over the entire surface of the muscle membrane (Salpeter and Loring 1985; Laufer and Changeux 1989; Changeux 1991). With the onset of motor innervation, AChR starts to cluster beneath the exploratory nerve ending while the density of extrajunctional AChR progressively decreases. In the majority of higher vertebrates’ skeletal muscles, a single or “focal” aggregate forms, while in a few others [such as anterior latissimus dorsi (ALD) in the chick], multiple, regularly spaced postsynaptic AChR clusters persist in the adult (Toutant et al. 1980). Among the factors that may contribute to this subcellular morphogenesis, prominent is the presence of multiple nuclei in the developing and adult muscle fiber, and the possible relationship between a differential distribution of AChR molecules and the topology of muscle nuclei actively transcribing AChR genes. In the developing myotube, all nuclei from the muscle syncytium do transcribe AChR genes and this expression becomes progressively restricted to the junctional or “fundamental” nuclei in the absence of intercellular cleavage. A compartmentalized expression of AChR genes thus takes place (Changeux 1991; Changeux et al. 1990) that has been postulated to operate under the control of neural factors [such as calcitonin gene-related peptide (CGRP) (Fontaine et al. 1986) or the ARIA (“AChR-inducing activity”) 42-kDa polypeptide (Harris et al. 199111 as well as electrical activity, via distinct intracellular pathways of second messengers and transcription factors (Fontaine et al. 1987). The aim of the present model is to account for endplate morphogenesis by linking into a coherent, minimal theoretical construct: gene transcription regulation, nuclear topology, and diffusible substances involved in transmembrane as well as internuclear signaling (i.e., morphogens).
Model for Motor Endplate Morphogenesis
343
2 Presentation of the Model
The model is designed to encompass three main components: (1) the sarcoplasmic nuclei that potentially express AChR genes, (2) transmembrane signals, both localized at the interface between functional innervation and muscle sarcoplasm, as well as propagating along the fiber length (action potentials), and (3) the cytoplasm, as a medium through which internuclear signals diffuse. Concerning these subsystems, the salient experimental findings, which we shall use as our biological premises, are as follows. At the nucleus level, in situ hybridization with mixed exonic-intronic and strictly intronic probes reveals a discrete distribution of AChR a-#P-, 6-, or €-subunit mRNA at junctional nuclei (Fontaine et al. 1988; Goldman and Staple 1989; Fontaine and Changeux 1989); also, in noninnervated cultures (Fontaine and Changeux 1989; Harris et al. 1989; Bursztajn et al. 1989; Horovitz et al. 1989; Berman et al. 19901, or in denervated muscle fibers, nuclei may coexist with contrasted levels of AChR a-subunit messenger, suggesting that all-or-none switch mechanisms may control transcription (Fontaine and Changeux 1989). Similar switches have been postulated (Monod and Jacob 1962) or reported in other developing systems as well (Kaufmann 1986; Thomas and IYAri 1990). A simple biological mechanism for such a switch consists of a single gene a (and its product A ) whose activity is being controlled through a positive feedback loop by the transcription factor it codes for. Here, A may for instance turn on, in addition, a cascade (Britten and Davidson 1969) leading to transcription of the receptor subunit genes at the nucleus. Another, stronger mechanism, would involve two genes a and i coding for transcription factors A and I, each factor enhancing its own synthesis while repressing that of the other. This simple genetic regulation network may evolve into a "flip-flop" with two highly stable states, one with high A, the other with high I. Product A might stand for one or several myogenic factors such as proteins belonging to the MyoD family (including myogenin), which induce myoblast differentiation (Davis et al. 1987) and are positively autoregulated (Thayer et al. 1989); while I could be identified with some of their putative antagonists, for example, the c-Jun or Id proteins. The identification for I, however, is less clear than that of A, and we shall thus focus our attention on the latter. The membrane of the muscle fiber transmits two types of signal fast, propagating electrical impulses, and responses to "anterograde" factors released by motor nerve endings such as CGRP or ARIA. Electrical activity is known to repress the transcription of AChR subunit genes both in vivo and in vitro (Changeux 1991). Propagation of the electrical action potential along the fiber membrane causes the entry of Ca2+ ions that could possibly serve as an intracellular second messenger for AChR gene repression. Among the candidate regulatory molecules taking part directly or indirectly in this repression is the Ca2+ lipid-dependent pro-
344
Michel Kerszberg and Jean-PierreChangeux
tein kinase C, whose activation reduces AChR a-subunit mRNA levels (Klarsfeld et al. 1989). The model thus assumes that electrical activity, through one or several messenger systems of this type, reduces AChR gene transcription, and it does so through a reduction in activity of the factor A as defined above: as a consequence the autocatalytic effect of A is lowered, and the probability of going or staying in the high-A state is diminished. The propagation of this effect is taken as infinitely fast, and Ca2+ influx is viewed as occurring simultaneously at all points of the fiber. In contrast, the effect of trans-synaptic, anterograde factors such as CGFW is localized to the immediate vicinity of the neural release site. Another important difference is that while the effects of electrical activity on mRNA levels are sensitive to translation inhibitors-making it likely that they involve continuous synthesis of metabolically short-lived transcription factors, such is not the case for the action of CGRP, a finding consistent with the notion that these two phenomena involve distinct intracellular pathways and protein regulatory units (Duclert et al. 1990; Fontaine et al. 1987). At the cytoplasm level, the model requires some means for nuclei to communicate their state of activity to each other. We posit that this function is carried out by a diffusible substance or morphogen (Wolpert 1969). There is no definite experimental proof as yet for the existence of biological morphogens, that is, substances whose concentration gradients would be responsible for the differentiation of spatial regions during development. However, plausible candidates have been suggested (Eichele and Thaller 1987; Driever and Nusslein-Vollhard 1988a,b). In the present instance, myogenic factors of the MyoD family (Davis et al. 1987)might satisfy our theoretical requirements for a morphogen, since their metabolic lifetime is quite short, and since both themselves and their mRNAs are diffusible substances that can ensure information transfer between nuclei. As the morphogen concentration must be both representative of the transcription state of, say, the AChR a-subunit gene (information transmission), and capable of modulating AChR a-subunit mRNA transcription, it appears legitimate to identify the morphogen with substance A (see above). Nuclei would thus establish communication (Blau etal. 1983) by transcribing a morphogen mRNA, leading to the synthesis of limiting amounts of morphogen in the cytoplasm. Cytoplasmic morphogen, in turn, would penetrate muscle nuclei through their nuclear membrane and bind to complementary DNA promoter elements, thereby becoming trapped while activating its own transcription. The elements of the model are summarized in Figure 1. 3 Formal Definition of the Model
We shall formalize the postulated genetic switches as a set of simple ONOFF Boolean variables Si with values 0 or 1 attached to each nucleus i.
Model for Motor Endplate Morphogenesis
9
................................................
345
CGRP
\ ,______
I
k I
i
........
i i
! P -/
............................................
Figure 1: The model. Three nuclei (dotted boxes) are depicted; from left to right: -Subsynaptic nucleus actively synthesizing AChR. The activator gene (a), coding for the diffusing morphogen (A), is active, enhancing its own transcription and that of AChR, while repressing that of the inhibitor. Neural factors (”CGRP”) arrive from the presynaptic terminal as well, further enhancing AChR transcription through second messenger system(s). - Non-subsynaptic nucleus. The inhibitor (I) is synthesized, turning off transcription of the activator, and that of AChR. Amount of activator diffusing from other nuclei is not sufficient to overcome this effect. - Silent subsynaptic nucleus. Neural factors alone are not sufficient to start AChR expression. Along the whole fiber, the morphogen A autocatalytic action is reduced by second messengers activated through electrical signaling.
By convention, Si = 1denotes a high rate of transcription for some AChR subunit gene (e.g., the a-subunit), and Si = 0 a low rate. This type of mechanism may easily be generalized to other situations, for example, involving more genes and thus more variables per nucleus i linked by local enhancement-repression type of interactions (Thomas and DAri 1990). The switches interact with one another through competition for limiting amounts of morphogen diffusing (Blau et al. 1983) between them. As discussed above, the identification of a positive regulator (e.g., as some member of the Myo-D family) is relatively secure, that of a negative regulator less so. Our computations being designed so as not to depend on details of the switch implementation or the exact chemical nature of the factors involved, we shall concentrate on the more essential positive regulator A. Two sets of equations must be written, one for the states Si
Michel Kerszberg and Jean-PierreChangeux
346
of the switches i, as controlled by the trapping of morphogen molecules in the nuclear region and at the promoter site(& the other for diffusion of the morphogen A itself. We assume that A has concentration-dependent probabilities for occupying the a promoter site(& and that whenever a promoter is occupied by A, transcription takes place. The transition probabilities per time step for the switches i are thus Si(t) = 1
+ 1) = 1 with probability p + af(Aj) + I ) = 0 with probability 1 - p - af(Ai) + Sj(t + 1)= 1 with probability 1 - u + Pf(Ai) S i ( t + 1)= 0 with probability v - Pf(Ai) +
Sj(t
Sj(t
Si(t) =0
(3.1)
where f ( a ) denotes a threshold function, that is,
A < T + f ( A )= O A > T =+ f ( A ) = 1
(3.2)
while p, a, u, and P are parameters obeying such restrictions that make the probabilistic interpretation of equations 3.1 possible. The probabilities of staying or going “ O N are both increased when Ai 2 T. It is interesting to note that while we shall be interested exclusively in regimes leading to smooth, uniform behavior of Si(t), values of p, u, a, and P are not precluded, which would yield oscillatory behavior of the activities, a situation that may be of interest in other experimental systems (Bargiello et al. 1984). Morphogen diffusion and synthesis are postulated to be described by the following equations for Ai:
Aj(t + 1) =
cb
+ Aj(t){
~i T,E(t) - %[I - b S j ( f ) ] }
+ k {[I- 0 S i - 1 ( t ) ] A i -(~t ) } + k {[I- ~ S i + l ( t ) ] A i + ~ ( t ) } .
(3.3)
Equations 3.1,3.2, and 3.3 are the central expressions defining our model. In 3.3, c b denotes the basal level of synthesis; k is the diffusion coefficient, or internuclear “hopping” probability per unit time for A molecules. We neglect, in this simple model, spatial fluctuations in k: the nuclei are indifferently spaced and the diffusion rates between them (assumed to be controlled mostly by the nuclear membrane barrier rather than by the internuclear distance, which may vary in the course of contraction) are the same for any two neighbors. Note that the equations must be modified for those nuclei closest to the tendinous ends of the fiber. It is important to realize that when the nuclear switch is turned “ON,” only a fraction 1 - 0 of the morphogen is available for diffusion: the rest is trapped in the nucleus where it directs enhanced mRNA transcription (see below). This interpretation of 3.3 requires that the involved transcription
Model for Motor Endplate Morphogenesis
347
factors be rather scarce, which seems indeed to be the case for, say, the MyoD molecule. The A-enhanced transcription of the a gene yields fresh A product at a rate described by q. The latter embodies the net effect of turnover (degradation) and autocatalytic biosynthesis. It depends on whether a nerve ending is located in the immediate vicinity of nucleus i and on the state of activity Si of this nucleus: isolated nucleus ri = under a synaptic terminal Ti =
70 70
+ TnSi + 7s + TnSi
(3.4)
When the switch is "ON," synthesis proceeds at a faster pace, as described by 7,. Transcription is also boosted further by anterograde factors such as CGRP present under a terminal ( T ~ )and acting there as "synaptotrophic" effectors. The latter are independent of the autocatalytic synthesis state Si, and their efficacy need not be very strong (i.e., Ts
<< 'Tn,'To).'
On the other hand, we postulate that the depressing effect of electrical activity E results from a reduction of the net autocatalytic synthesis of A. Thus in the absence of electrical activity E ( t ) = 0 ongoing electrical activity E ( t ) = 1 - gn(')
(3.5)
E is a measure of the extent of second-messenger activation by channel opening due to impulse arrival, and n ( t ) denotes the number of subsynaptic nuclei actively transcribing the a gene at time t. We assume n is indicative of the number of functional synaptic boutons; as one bouton is already sufficient to generate action potentials, equation 3.5 must rise sharply with n and saturate quickly at 1 (this will occur provided g is small). 4 Analytical Results for the Adult Endplate We calculate now an approximate analytical solution of great interest, namely one where only a single focal nucleus per muscle fiber is actively expressing AChR genes. While the full problem is nonlinear, equations 3.3 are linear in Ai. Therefore, one can conveniently assume that diffusion of A is slow with 'This can be viewed mathematically as the application of a so-called "symmetrybreaking field." Assuming all nuclei are equivalent in an infinite muscle fiber, a situation involving synthesis in only a subset of nuclei necessarily entails a breaking of the original equivalence. While the existence of symmetry-breaking solutions is an intrinsic property of the system, the choice of how exactly the equivalence is broken may be the result of an extremely weak '%biasing effect" such as that exerted by our synaptotrophic field.
Michel Kerszberg and Jean-Pierre Changeux
348
respect to the switch dynamics? and that the switch is therefore at all times in a stationary situation constrained by fixed A? Accordingly, the focal solution, obtained by speclfying that one given nucleus only is active (the question of its location will be discussed later) is now determined by solving first the master equation corresponding to equation 3.1; in order to extract the probabilities for finding the switch i in the "ON position when Ai is below equation 4.3 or above equation 4.4threshold. We obtain s- = I - u Ai < T (4.3) 2-p-u1
s+
=
1-u+p 2 - p - v - cr + p '
Ai 1 T
(4.4)
Substituting this in 3.3, we get a slow-time effective equation for Ai, enabling us to compute quantities of interest. Chief among those is the diffusive decay length that describes, once the system has reached a stable state, the spatial variation of morphogen concentration around the nucleus where AChR gene expression takes place. This stable, time-independent concentration of A will be given by solving 3.3 with 4.4 substituted in the equation for A0 (subscript 0 denoting the active site), and 4.3 in the others (Ail i # 0). Writing A(t 1) = A(t) (time invariance), we find that the stabilized ratio of Ai to Aj+l, if0is
+
(4.5) which defines the decay length A, where ~ ( 1is) the net synthesis rate at the nonactive sites and A, is the concentration far away from the active nucleus, that is, A, = cb/[l - 7(1)]. The single-active-nucleus solution can be further specified, by computing the ratio Ao/AI, not covered by
- q,
2The diffusional time scale td where kff= k(1- a). 3While useful for analytical work, the biological validity of this so-called "quasistationary" assumption is far from proven. It is certainly correct as far as the time independent solutions to be discussed below; as to dynamic phenomena, we have seen that similar results are obtained by computer simulation, whether the assumption is used or not. 4These equations are
q+1 Pf'l
- 11, = l-v+p+Pf[p+v-l+a-P],
= 1- v+PI [ p + v
Cj < T Cj>T
(4.1) (4.2)
where Pi'stands for the probability that Sj = 1 at time t . It is easy to ensure that either of these equations relaxes to equilibrium in a time much less than td (see note 3).
Model for Motor Endplate Morphogenesis
349
the previous formulaa5This defines the one-active-nucleus solution completely. However, for the solution to be consistent, it is imperative that A0 be higher than the threshold T, and A1 lower. Furthermore, T must be higher than A,. With parameter values as described in the legend of Figure 2, we find a decay wavelength X = 7.09, while A0 = 2.97 and A, = 0.40 (all these under electrical stimulation). The wavelength is such that around the isolated nucleus, on the order of 7 nuclei will be inactivated on each side. The precise value depends on the threshold T above which the morphogen becomes effective. Why are the nuclei inactivated? The morphogen is trapped by the promoter sites in the nucleus that it activates. Its mobility is thereby reduced, as well as its concentrationin the vicinity: hence, the low value of A,. The concentration far away from the single active nucleus is A, = 0.52. Thus, by chosing a threshold T comprised between A, and Ao, one ensures that only one active nucleus may stably exist in the fiber. Direct dynamic simulations indeed confirm this expectation (see below), yet the exact position of the active nucleus appears difficult to assess analytically at this stage. In general, the value of A0 when the active nucleus is at the tendon6 will be lower than otherwise: for the case at hand, A0 becomes 2.38. This means the buildup of morphogen concentration required to stabilize the active nucleus occurs more readily at the tendinous end-where it has only one escape direction open. It is thus clear that the average transcription level will be rather high there. It must be stressed that X has a significance that goes well beyond the strict framework of the single-active-site solution. It will more generally be related to the approximate wavelength or spatial periodicity of any stable structure evolving from the system dynamics. Indeed, nothing in what we have said until now excludes the appearance, farther than a few X away from the active nucleus, of some additional activity pattern, in the form, for example, of another isolated active site. In general, periodic solutions may be expected with M active, N inactive alternating nuclei. Relative stability of those putative solutions is an extremely arduous problem to tackle analytically, so we shall presently turn to computer simulations. 5Morphogenconcentrations at and in the immediate vicinity of the active site are given as the solutions of the following set of linear equations: (4.6) (4.7)
In the first of these equations, L equals 2 when the active nucleus is in the midst of the fiber itself, or 1 when it is located at a tendinous end. 6That is, L = 1 in equations 4.6 and 4.7.
350
Michel Kerszberg and Jean-Pierre Changeux
Figure 2: (a) Dynamics of transcription states in a focally innervated muscle fiber. States are sampled at intervals of lo00 computation steps (successive lines, from top to bottom). Initial innervation is random. Where and when transcription takes place a box is drawn, full for subsynaptic nuclei, empty for the others. Before the onset of electrical activity, nuclei are mostly active; in the presence of electrical inputs (arrow), transcription is generally repressed but persists in two nuclei, one subsynaptic and located near the center of the fiber, the other near a tendinous end. Note that this latter feature is not always present, although finer time-dependence analysis always shows repression to proceed slowest near the tendons and at the center. (Fiber is innervated by 10 randomly distributed terminals and contains 30 nuclei with genetic switches having a = 0.88, B = 0.80, p = 0.10 and v = 0.95, u = 0.9. Parameter g of equation 3.5 is 0.2, and morphogen diffusion is controlled by k = 0.1. Net synthesis rates (refer to Eqs.3.3and 3.4)are T~ = 0.002, TO = 0.9997,T,, = 0.00025 and q = 0.00004.)(b) The effects of denervation. This is simulated here by switching off electrical input as well as anterograde signaling from the afferent motor endings at all times below the lower arrow. One observes "waves" of renewed transcriptional activity, spreading from the "central" nucleus, as well as from the tendinous ends (whether the nuclei there were initially active or not). Such a spread is typical of diffusion and this figure thus summarizes a strong prediction of the model as a modified reaction-diffusion system.
Model for Motor Endplate Morphogenesis
351
5 Computer Simulations of Developmental Dynamics Computer simulations of the system are consistent with analytical figures wherever the latter are available, that is, in the case of stabilized focal innervation. Moreover they open new perspectives into the behavior of the full nonlinear set of equations, which is out of the reach of analytical methods. Numerical calculations do not depend on the approximation used above, namely that the genetic switch adjustment is fast with respect to morphogen dynamics; yet they are compatible with our previous findings. They cover three cases, those of focal or multiple innervation, and that of noninnervated systems. We assume that 30 nuclei are present along the fiber, and that only a fraction of them are located under newly formed presynaptic terminals. Initially, 10 of those are distributed at random, each above one nucleus. This is a plausible description of the initial state in early endplate morphogenesis.
5.1 Focal Innervation. The upper part (a) of Figure 2 displays successive "snapshot" pictures of the state of activity of the 30 nuclei, at progressively later times from top to bottom. It can be seen how, with sustained electrical activity (which starts below the horizontal line), transcription in most of the nuclei is repressed. Analysis on a finer time scale shows transcription to diminish differently depending on position, with transcription around the fiber center and in nuclei situated near the tendons resisting longest. The final configuration usually comprises one active, subsynaptic nucleus located rather near the middle of the fiber (Salpeter and Loring 1985; Laufer and Changeux 1989; Changeux 1991). Persistent activity near the tendinous ends, while surprising at first, actually confirms the analytical calculations of the previous paragraph. When we take into account late growth of the myotube through myoblast fusion at its extremities (see Fig. 31, we find that with such growth, transcription may remain high in the nuclei incorporated latest in the syncytium, that is, those at the tendinous ends. Experimental support for this finding has been reported recently (Fontaine and Changeux 1989; Klarsfeld et al. 1991). Not pictured here, but observed in some cases, are anomalous patterns such as persistence of two nuclei actively engaged in transcription, or extinction of sustained transcription altogether. 5.2 Denervation. A situation of much interest concerns denervation. Experimental denervation usually leads to a reactivation of AChR gene transcription by many extrajunctional nuclei (Salpeter and Loring 1985; Laufer and Changeux 1989; Changeux 1991; Goldman and Staple 1989). We simulate denervation here by assuming that it simply causes the E variable (see equation 3.5) and the T~ parameter to decrease to zero values.
352
Michel Kerszberg and Jean-PierreChangeux
Figure 3: Myotube growth through myoblast fusion at the extremities is simulated on this figure. Growth starts after an initial innervation and stabilization phase. The myoblasts fusing with the fiber are assumed to express the AChR a-subunit gene at a high rate. It is seen how, through diffusion of the activator, this state is lost more or less quickly (compare the two sides of the figure!). On one side, strong transcription-related labeling would be expected, as observed experimentally(see text).
Figure 2b displays the results of a computer run where "denervation" occurs after a period of electrical stimulation (i.e., at the second horizontal line from the top). It shows waves of transcription onset starting from "seed" nuclei at which concentration of A was initially high, and spreading progressively to the whole fiber. Such seed nuclei include the initially active (subsynaptic) ones, as well as the near-tendon units. Indeed, denervation experiments performed on adult muscle reveal a nonuniform reappearance of the receptor protein and mRNAs, which are first re-expressed in the neighborhood of the endplate (Goldman and Staple 1989; Salpeter and Loring 1985; Neville et al. 1991).
Model for Motor Endplate Morphogenesis
353
a
b
Figure 4: (a) Noninnervated muscle cultures. Spontaneous electrical activity is present in this system, and leads to multiple transcriptional foci. Note that strong lateral labeling by transcription markers is predicted. For T~ = 0, T~ = 0.0003, T = 2.50 (all other parameters unchanged) we find that clusters of transcribing nuclei appear in the form of doublets or transient triplets. The groups themselves are more or less regularly spaced. When one takes T~ = 0.0006, four isolated, "regularly" spaced nuclei are found (not shown). (b) Dynamics of AChR gene expression in the multiply innervated muscle fiber. Here again, we assumed a reduced efficacy ( T ~= 0.00035) in the repressing effect of electrical activity, due, e.g., to spike "bunching." Under these conditions, more or less regularly spaced transcription sites persist, as indeed observed, e.g., in the chick anterior l a h i m u s dorsi muscle. Notice that nuclei at the tendinous ends are invariably the site of intense transcription in this case. From parts (a) and (b) of the figure it is apparent that the final innervation pattern is determined to a large extent by the underlying sarcoplasmic morphogenesis (a) interacting with a set of available exploratory synaptic boutons.
5.3 Cultured Fibers without Innervation. Figure 4a pertains to the situation where electrical activity occurs spontaneously, as is the case in primary cultures of chick embryonic muscle fibers (Fontaine and Chang-
354
Michel Kerszberg and Jean-PierreChangeux
eux 1989; Harris et al. 1989; Bursztajn et al. 1989; Horovitz et al. 1989; Berman et al. 1990). This corresponds formally to T~ = 0. The resulting structure may be called “imperfectly periodic”: there is obviously a preferred spacing between active units, but this is not a strict constraint. Strings of “doublets” or even the transient existence of “triplets” of active nuclei are predicted. For different numerical settings (not shown), strings of isolated, or “singlet” active nuclei become apparent. More complex but less stable configurations than those described here might occur if a larger range of parameter values were to be explored. Interestingly, in situ hybridization experiments with cultured chick myotubes reveal, within the same fiber, silent nuclei alternating with others expressing AChR subunit mRNA, following a pattern of rather poor regularity (Fontaineand Changeux 1989; Harris et al. 1989; Bursztajn et al. 1989; Horovitz et al. 1989; Berman et al. 1990). 5.4 Multiple Innervation (ALD). On Figure 4b, we have selected, for the simulation, numerical parameters which yield a final pattern reminiscent of ALD multiple-focus innervation. One notices rather short-lived clusters of active, mostly sub- or near synaptic nuclei. They are obviously reminiscent to the doublets seen in Figure 4a, and display again a “near-periodicity.” It is quite apparent again how transcription is actively proceeding at the tendons, which seem to “anchor” naturally the nearperiodic pattern of active nuclei. These results suggest very strongly that, in our model, the underlying pattern of gene activation is the dominant factor controlling synapse stabilization. We see here an illustration of how a reduction in the average depressing effect of electrical activity may cause a change in the spatial pattern of genetic expression. On the time scales considered, which are slow compared to nerve action potentials (or even interspike intervals) the reduction in T~ leading to this change may be seen as originating, for instance, from a time patterning of electrical stimulation. Yet, a detailed implementation of such time patterns in electrical stimulation, and their effects on synapse morphogenesis, has not been attempted at this stage.
6 Outlook
Many models, reviewed recently (Van Essen 19821, have been introduced in order to explain the inhomogeneous distribution of AChR along the muscle fiber. None of these previous attempts, however, took account of the inhomogeneity already present at the level of transcription. Here we present a theoretical hypothesis whereby differential AChR gene expression, as controlled by scarce transcription factors for which nuclei enter into competition, may be one of the critical determinative events in motor endplate morphogenesis.
Model for Motor Endplate Morphogenesis
355
The model, which emphasizes the autonomous evolution of the muscle nuclei, leads to satisfactory agreement with a set of known experimental facts concerning the development of the motor endplate upon innervation. It may, however, be of broader import and apply, with suitable modifications, to a variety of situations involving morphogenesis (Hafen et al. 1984; Izpis6a-Belmonte et al. 1991). Crucial for the relevance to cellularized embryonic structures may well be the recent report (Joliot et ul. 1991) that homeotic regulation factors might be able to cross membrane boundaries. The mechanisms proposed here apply to the case of syncytia; when individual, mononucleated cells (such as neurons) are present, one has to include in the model intercellular "morphogens," the mechanisms of their secretion (Torre and Steward 19921, of their recognition by membrane receptors, and of signal transduction. The model could, with suitable modifications, be generalized to synaptogenesis elsewhere in the peripheral and central nervous systems. Synaptogenesis through selective stabilization is of course taking place during development ( k w e s and Lichtman 1980) and has been reported in the cerebellum (Crepe1 et al. 1976; Mariani and Changeux 1980) and several areas of neocortex (Shatz et al. 19901, but has been observed directly in the living adult parasympathetic system as well (Purves and Voyvodic 1987). Recently, transient overproduction of neurotransmitter receptors has also been reported in diverse regions of the primate cerebral cortex (Lidow et al. 1991), and might lead to subsequent morphogenetic phenomena as modeled here if one considers a cortical column as a cellularized analog of the developing myotube. One of our findings that may be of significance for morphogenesis in general is the transcription onset waves we have observed on denervation. The latter are characteristic of a diffusion process and may help substantiate rather simply the morphogen hypothesis in a variety of situations. In mathematical terms, the formulation introduced here represents both a simplification of the original morphogenesis formalization (Turing 1952; Meinhardt 1986) and its generalization. It is a mathematical simplification because the nonlinear aspects are limited to a set of points (the nuclei) rather than being spread over the whole space, thus rendering the model highly tractable analytically; conversely, it represents a generalization since it includes an interaction of the morphogen with the genetic machinery involved in morphogen synthesis itself. The basic ingredients are an autocatalytic loop through the enhancement of activator (morphogen) transcription by its own gene product, long-range inhibition through trapping of the activator by DNA elements in nuclei, thus reducing its availability; and even longer (infinite) range inhibition by electrical activity. The model clearly produces a number of predictions at the level of the molecular mechanisms of gene expression that can be experimentally tested.
356
Michel Kerszberg and Jean-Pierre Changeux
References Bargiello, R. A., Jackson, F, R., and Young, B. W. 1984. Restoration of circadian behavioral rhythms by gene transfer in Drosophila. Nature (London) 312,752754. Berman, S. A., Bursztajn, S., Bowen, B., Gilbert, W. 1990. Localization of acetylcholine receptor intron to the nuclear membrane. Science 247,212-214. Blau, H., Chiu, C.-P., Webster, C. 1983. Cytoplasmic activation of human nuclear genes in stable heterocaryons. Cell 32, 1171-1180. Britten, R. J., and Davidson, E. H. 1969. Gene regulation for higher cells: A theory. Science 165,349-356. Busztajn, S., Berman, S. A., and Gilbert, W. 1989. Differential expression of acetylcholine receptor mRNA in nuclei of cultured muscle cells. Proc. Natl. Acad. Sci. U.S.A. 06, 2928-2932. Changeux, J. P. 1991. Compartmentalized transcription of acetylcholine receptor genes during motor endplate epigenesis. New Biologist 3,413-429. Changeux, J. P., Babinet, C., Bessereau, J. L., Bessis, A., Cartaud, A., Cartaud, J., Daubas, P., Devillers-Thibry,A., Duclert, A., Hill, J. A., Jasmin, B., Klarsfeld, A., Laufer, R., Nghiem, H. O., Piette, J., Roa, M., and Salmon, A. M. 1990. Compartmentalizationof acetylcholine receptor gene expression during development of the neuromuscular junction. Cold Spring Harbor Symp. Quant. Biol. LV,381-396. Crbpel, F., Mariani, J., and Delhaye-Bouchaud, N. 1976. Evidence for a multiple innervation of Purkinje cells by climbing fibers in immature rat cerebellum. J. Neurobiol. 7, 567-578. Davis, R. L., Weintraub, H., Lassar, A. B. 1987.Expression of a single transfected cDNA converts fibroblasts to myoblasts. Cell 51,987-1000. Driever, W., and Niisslein-Vollhard, C. 1988a. A gradient of bicoid protein in Drosophila embryos. Cell 54, 83-93. Driever, W., and Nusslein-Vollhard, C. 1988b. The bicoid protein determines position in the Drosophila embryo. Cell 54, 95-104. Duclert, A., Piette, J., and Changeux, J. P. 1990. Induction of acetylcholine receptor a-subunit gene expression in chicken myotubes by electrical activity blockade requires ongoing protein synthesis. Proc. Natl. Acad. Sci. U.S.A. 07, 1391-1395. Eichele, G., and Thaller, C. 1987. Characterization of concentration gradients of a morphogenetically active retinoic acid in the chick limb bud. J. Cell. Biol. 105,1917-1923. Fontaine, B., Klarsfeld, A., Hokfelt, T., and Changeux, J. P. 1986. Calcitonin-gene related peptide, a peptide present in spinal cord motoneurons, increases the number of acetylcholine receptors in primary cultures of chick embryo myotubes. Neurosci. Lett. 71,59-65. Fontaine, B., Klarsfeld, A., and Changeux, J. P. 1987. Calcitonin-gene related peptide and muscle activity regulate acetylcholine receptor a-subunit mRNA levels by distinct intracellular pathways. J. Cell Biol. 105, 1337-1342. Fontaine, B., Sassoon, D., Buckingham, M., and Changeux, J. P. 1988. Detection of the nicotinic acetylcholine receptor a-subunit mRNA by in situ hybridiza-
Model for Motor Endplate Morphogenesis
357
tion at neuromuscular junctions of 15-day-old chick striated muscle. EMBO
J. 10, 625-632. Fontaine, B., and Changeux, J. P. 1989. Localization of nicotinic acetylcholine receptor a-subunit transcripts during myogenesis and motor endplate development in the chick. J. Cell Biol. 108, 1025-1037. Goldman, D.,and Staple, J. 1989. Spatial and temporal expression of acetylcholine receptor RNAs in innervated and denervated rat soleus muscle. Neuron 3, 219-228. Hafen, E., Kuroiwa, A., and Gehring, W. J. 1984. Spatial distribution of transcripts from the segmentation gene fushi tarazu during Drosophila embryonic development. Cell 37, 833-841. Hams, D. A., Falls, D. L., and Fischbach, G. D. 1989. Differential activation of myotube nuclei following exposure to an acetylcholine receptor-inducing factor. Nature (London) 337, 173. Harris, D. A., Falls, D. L., Johnson, F. A., and Fischbach, G. D. 1991. A prionlike protein from chicken brain copurifies with an acetylcholine receptorinducing activity. Proc. Natl. Acad. Sci. U.S.A. 88, 7664-7668. Horovitz, O.,Spitsberg,V., and Salpeter, M. M. 1989. Regulation of acetylcholine receptor synthesis at the level of translation in rat primary muscle cells. J. Cell Biol. 108, 1817. Izpisba-Belmonte, J. C., Tickle, C., Dollb, P., Wolpert, L., and Duboule, D. 1991. Expression of the homeobox Hox-4 genes and the specification of position in chick wing development. Nature (London) 350,585-589. Joliot, A. H., Triller, A., Volovitch, M., Pernelle, C., and Pmchiantz, A. 1991. a-2,8-polysialic acid is the neuronal surface receptor of antennapedia homeobox peptide. Nao Biologist 3, 1121-1134. Kaufmann, S. A. 1986. Boolean systems, adaptive automata, evolution. In Disordered Systems and Biological Organization, E. Bienenstock, F. FogelmanSoulib, and G. Weisbuch, eds., pp. 339-360. Plenum Press. Klarsfeld, A., Laufer, R., Fontaine, B., Devillers-Thiery, A., Dubreuil, C., and Changeux, J. P. 1989. Regulation of muscle AChR a-subunit expression by electrical activity: Involvement of protein kinase C and Ca++. Neuron 2, 1229-1236. Klarsfeld, A., Bessereau, J. L., Salmon, A. M., Triller, A., Babinet, C., and Changeux, J. I? 1991. An acetylcholine receptor a-subunit promoter conferring preferential synaptic expression in muscle of transgenic mice. EMBO J. 10, 625-632. Laufer, R., and Changeux, J. l? 1989. Activity-dependent regulation of gene expression in muscle and neuronal cells. Mol. Neurobiol. 3, 1-35. Lidow, M. S., Goldman-Rakic, P. S., and Rakic, P. 1991. Synchronized overproduction of neurotransmitter receptors in diverse regions of the primate cerebral cortex. Proc. Natl. Acad. Sci. U.S.A. 88, 10218-10221. Mariani, J., and Changeux, J. P. 1980. Multiple innervation of Purkinje cells by climbing fibers in the cerebellum of the adult staggerer mutant mouse. J. Neurobiol. 11,41-50. Meinhanit, H. 1986. Hierarchical inductions of cell states: A model for segmentation in Drosophila. J. Cell Sci. Suppl. 4, 357-381.
Michel Kerszberg and Jean-Pierre Changeux
358
Monod, J., and Jacob, F, 1962. General conclusions: Teleonomic mechanisms in cellular metabolism. Cold Spring Harbor Symp. Quant. B i d . XXVI,389. Neville, C., Schmidt, M., and Schmidt, J. 1991. Kinetics of expression of Ach receptor a-subunit mRNA in denervated and stimulated muscle. NeuroReport 2, 655-657. New, H. V., and Mudge, A. W. 1986. Calcitonin-gene related peptide regulates muscle acetylcholine receptor synthesis. Nature (London) 323, 809-811. Purves, D., and Lichtman, J. 1980. Elimination of synapses in the developing nervous system. Science 210,153-157. Purves, D., and Voyvodic, J. 1987. Imaging mammalian nerve cells and their connections over time in living animals. Trends Neurosci. 10, 398-404. Salpeter, M., and Loring, R. H. 1985. Nicotinic acetylcholine receptors in vertebrate muscle: Properties, distribution and neural control. Prog. Neurobiol. 25, 297-325. Shatz, C. J., Gosh, A., McConnell, S. K., Allendoerfer, K. L., Friauf, E., and Antonioni, A. 1990. Pioneer neurons and target selection in cerebral cortical development. Cold Spring Harbor Symp. Quant. B i d . LV,469480. Thayer, M. J., Tapscott, S. J., Davis, R. L., Wright, W. E., Lasser, A. B., and Weintraub, H. 1989. Positive autoregulation of the myogenic determination gene MyoDl. Cell 58,241-248. Thomas, R., and D’Ari, 1990. Biological Feedback. CRC Press, Boca Raton, FL, and references therein. Torre, E. R., and Steward, 0. 1992. Demonstration of local protein synthesis within dendrites using a new cell culture system that permits the isolation of living axons and dendrites from their cell bodies. 1.Neurosci. 12,762-772. Toutant, M., Bourgeois, J. P., Toutant, J. P., Renaud, D., Le Douarin, G. H., and Changeux, J. P. 1980. Chronic stimulation of the spinal cord in developing chick embryo causes the differentiation of multiple clusters of acetylcholine receptor in the posterior latissimus dorsi muscle. Dev. B i d . 76, 384-395. Turing, A. M. 1952. The chemical basis of morphogenesis. Phil. Trans. R. SOC. (London) B 237, 37-72. Van Essen, D. C. 1982. Neuromuscular synapse elimination: Review. In Neuronal Development, N. C . Spitzer, ed. Plenum Press, New York. Wolpert, L. 1969. Positional information and the spatial pattern of cellular differentiation. 1. Theor. Biol. 25, 147. --
~~
Received 13 May 1992; accepted 24 September 1992.
This article has been cited by: 1. Michel Kerszberg, Jean-Pierre Changeux. 1998. A simple molecular model of neurulation. BioEssays 20:9, 758-770. [CrossRef]
NOTE
Communicated by John Platt
Universal Approximation by Phase Series and Fixed-Weight Networks Neil E. Cotter Peter R. Conwell Electrical Engineering Department, University of Utah, Salt Lake City, UT84112 USA
In this note we show that weak (specified energy bound) universal approximation by neural networks is possible if variable synaptic weights are brought in as network inputs rather than being embedded in a network. We illustrate this idea with a Fourier series network that we transform into what we call a phase series network. The transformation only increases the number of neurons by a factor of two. 1 Technical Preliminaries
Let g(x) be the bounded measurable real-valued function that we wish to approximate. We take the domain of g to be the unit hyper cube D = [-1/2, 1/2INwhere N is the number of entries in x = (XI,. . . , x N ) . To generalize our results below to an interval [ - X / 2 , X / 2 ] we would divide frequencies by X in our final phase series. By Lusin's theorem (Royden 19681, for any 6 > 0 there exists a continuous function, f, such that the measure of the set where f is not equal to g is less 6. Thus, by successfully approximating f we can restrict errors to an arbitrarily small percentage of inputs. We can represent f in turn with a Fourier series. We will assume that If1 is bounded by M / 2 & where M is a value we must specify before constructing our network. This limits us to what we call "weak universal approximation." Since we may take M as large as desired, however, this is a mild limitation. For the derivation we assume M = 1. This amplitude bound translates into a total energy bound of 1/8 on domain D. 2 Phase Series Derivation
Since the energy of a sinusoid on D is one-half its squared amplitude, an energy bound of 1/8 translates into a bound of f 1 / 2 for the coefficients of a Fourier series for f(x): f(x) =
C C abncos(2m o x + 6n/2)
(2.1)
~ = o , In
Neural Computation 5, 359-362 (1993) @ 1993 Massachusetts Institute of Technology
Neil E. Cotter and Peter R. Conwell
360
where n = (nl,. . . ,nN) is a frequency vector with entries ranging over all possible combinations of positive and negative integer values and 6 is a binary variable used to obtain sine terms by phase-shifting cosine terms. We observe that the left-hand side of the following trigonometric identity has the same form as the summand in equation 2.1: 2 COSAcos B = COS(B- A) + COS(B+ A)
(2.2)
We make the identification a6n = 2 cosA
(2.3)
Solving for A and substituting into equation 2.1 yields the phase series representation
f(x) =
CC C
cos(2m o x
+ 6a/2 + r]cos-'[uan/2])
(2.4)
6=0,1 n q=-i,i
The phase series has several features we wish to highlight 1. The phase and Fourier series have exactly the same value everywhere on D. 2. The coefficients of the cosine terms are unity regardless of f ( x ) . 3. The coefficients, c0s-~[a6,,/2],specifying f(x) are processed in the same way as input data-both the coefficients and the data are part of a weighted sum. 4. The multipliers, n and
r],
of the weighted s u m are independent of
f (XI. 3 Phase Series Neural Networks
We can implement both the original Fourier series in equation 2.1 and the phase series in equation 2.4 as neural networks having a single hidden layer and a linear output neuron. In the hidden layer, we could approximate the cosine term by summing familiar sigmoids such as the logistic squasher (Rumelhart and McClelland 1986) or the hyperbolic tangent (Hopfield 1984). For an exact representation, we sum copies of a "cosig" squasher (see Gallant and White 1988): 00
cosx = -1
+C C
+
cosig(xx - x2nk n/2)
(3.1)
k=-00 x=-I,l
where x 5 -R/2 - n/2) -7r/2 < x < 7r/2 x 2 R/2
(3.2)
Universal Approximation
361
Figure 1: Phase series network based on “cosig” sigmoid embedding one halfcycle of a cosine. Stacked boxes indicate a structure that is repeated for all values taken on by indices in the lower right-hand comer. The output sum thus has many inputs.
Substituting 3.1 into 2.4 yields the formula for a phase series neural network
c c c [-I + c c cosig(X2m oo
f(x)
=
6=o,1 n q=-i,i
ox
+ xb.rr/~
k=-oo x=-l,l
+ xq cos-’[&n/2] - X2Tk + */2)]
(3.3)
Figure 1 illustrates the phase series network. Note that the Fourier coefficients defining f(x) are embedded in network inputs. Thus, we have derived a neural network, capable of universal approximation, in which internal weights are fixed. Because the identity in equation 2.2 substitutes two cosine terms for one cosine term and a coefficient, this new network has only twice as many neurons as a Fourier network. 4 Discussion and Conclusion
There are possible advantages to using phase series in VLSI or optical circuits. First, we eliminate circuitry for accessing internal parameters. Second, more technologies are suitable for implementing the fixed synap tic weights inside the phase series network than are suitable for implementing varying synaptic weights inside a conventional neural network.
362
Neil E. Cotter and Peter R. Conwell
Third, constant parameters are less expensive and occupy less room than variable parameters in a circuit. A fixed resistance in a VLSI circuit or a fixed opacity in an optical circuit is relatively easy to manufacture.
Acknowledgments The authors are greatly indebted to an anonymous reviewer who pointed out that the identity in equation 2.2 is simpler than the one we originally
used.
References Gallant, A. R., and White, H. 1988. There exists a neural network that does not make avoidable mistakes. Proc. Int. Joint Conf. Neural Networks (IJCNN),Sun Diego, C A I, 657-664. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 81,3088-3092. Royden, H. L. 1968. Real Analysis, 2nd ed. Macmillan, New York. Rumelhart, D. E., and McClelland, J. L., (eds.) 1986. Parallel Distributed Processing: Exploratioiis in the Microstructures of Cognition. Vol. 1: Foundations. MIT Press, Cambridge, MA. Received 18 May 1992; accepted 22 September 1992.
This article has been cited by: 1. Judith E. Dayhoff. 2007. Computational Properties of Networks of Synchronous Groups of Spiking NeuronsComputational Properties of Networks of Synchronous Groups of Spiking Neurons. Neural Computation 19:9, 2433-2467. [Abstract] [PDF] [PDF Plus] 2. A.S. Younger, P.R. Conwell, N.E. Cotter. 1999. Fixed-weight on-line learning. IEEE Transactions on Neural Networks 10:2, 272-283. [CrossRef]
Communicated by Halbert White
NOTE
Backpropagation with Homotopy Liping Yang Wanzhen Yu Department of Applied Mathematics, Tsinghua University, 100084, Beijing, China
When training a feedforward neural network with backpropagation (Rumelhart et al. 19861, local minima are always a problem because of the nonlinearity of the system. There have been several ways to attack this problem: for example, to restart the training by selecting a new initial point, to perform the preprocessing of the input data or the neural network. Here, we propose a method which is efficient in computation to avoid some local minima. For a neural network, the output of every node is characterized by a nonlinear function (for example, the sigmoid function) that is the origin of local minima. Consider the following homotopy function fx(x) = Ax
+ (1 - A)s(x)
(1)
where s(x) is a sigmoid function and A E [0,1]. fx(x) forms a homotopy between linear and sigmoid function. Denote Nx to be the neural network characterized by f ~ We . start the training with A. = 1, that is, every node , choose &+I < Ak and is linear. After achieving a minimum of N A ~we continue the backpropagation procedures until Xks = 0 with which f x ( x ) is just the original sigmoid function. The learning of a feedforward network is to solve a nonlinear leastsquares problem,
I "
minF(w) = - c[gi(w) - siI2 2 i=l
(2)
where n is the number of training samples and gi is the output of network whose weight is w. This problem can also be treated by solving
VF(w) = 0
and
V2F(w)2 0
(3)
The homotopy method for solving nonlinear equations has been studied since 1978 (Chow et al. 1978; Li 1987). This method begins the homotopy process with an easily solvable equation (at X = 1) and tracing the zero point of homotopy function until getting the solution of original nonlinear equations (at A = 0). Neural Computation 5, 363-366 (1993) @ 1993 Massachusetts Institute of Technology
Liping Yang and Wanzhen Yu
364
As for the training of neural network, it is known that F(w) has a lot of local minima. For ordinary minimization methods, the algorithm often stops at local minima with large objective value and it is not easy to get a satisfactory solution. The homotopy method can overcome this difficulty to some extent. We can observe that the objective function F(w, A) is a polynomial of w when X = 1 and there exist very few minimum points. As X decreases to 0, the nonlinearity of objective function increases and more new minimum points appear. Because we have achieved a minimum point wk of F(w, Xk), which can provide a relatively better initial point for minimizing F(w, &+I), many unwanted local minima of F(w, are avoided. When computing with conventional BP method, it may occur that some components of w become so large that numerical instability will arise. This is because a very large change of x can only cause a very small change of s ( x ) , especially when 1x1 itself is large, that is, s'(x) -, 0 as 1x1 + 00. For this case, the usual treatment is to perform the scaling of w. The homotopy approach can avoid the infinity growth of w, because f { ( x ) > A. Our computational experiments show that the behavior of the homotopy method is good, although it does not guarantee the global minimum. In what follows, we shall study the process for decreasing A. Assume that we have solved minF(w, A) = F(w0, A) for a fixed X E (0,I]. Taking AX < 0, we need to solve minF(w, X + AX), which implies
vwF(w,
+ AX)lwo+Aw = 0
(4)
When w = WO, it gives
0
=
VWF(wo+ Aw, X + AX) - VWF(wo,A)
Whenever ViF(w0, A) is nonsingular, we can compute Aw by
a
AW = -V;F(wo, X ) - ' ~ V w F ( ~ X)AX o,
(7)
In practical computation, w o+ Aw can serve as a prediction of minimum point of F(w, X + AX). Table 1 illustrates an example of a separation problem. This problem contains 21 input points with corresponding outputs equal to +1 or -1. We use a neural network that has one hidden layer with 12 nodes. The sigmoid function s ( x ) is taken as 2/(1 + e-*) - 1. For this problem, the conventional BP algorithm fails to arrive at a totally correct solution, although many initial points have been tested. The homotopy method achieves a minimum point after 5172 iterations, which separate the input points correctly. The result is shown in Figure 1, where the region in white indicates the positive output of the neural network and the region in black indicates the negative output.
Backpropagation with Homotopy
365
Table 1: Separation Problem Input (0.05,0.05) (0.95,0.95) (0.05,0.95) (0.95,0.05)
Output
$1 +1 +1 +1 (0.50,0.50) -1 (0.95,0.50) +l (0.05,0.50) +1
Input
Output
Input
Output
(0.50,0.05) (0.50,0.95) (0.95,0.30) (0.30,0.05) (0.70,0.05) (0.70,0.95) (0.30,0.95)
+1 +1 -1 -1 -1 -1 -1
(0.05,0.30) (0.05,0.70) (0.40,0.50) (0.60,0.50) (0.50,0.40) (0.50,O.a) (0.95,0.70)
-1 -1 +1 +1 +1 +1 -1
Figure 1: Result of the homotopy method.
366
Liping Yang and Wanzhen Yu
Our computational experiments are made on about 20 separation problems whose sizes are similar to the illustrated one. The initial weights of neural networks are fixed for all problems. The conventional BP algorithm fails to get correct separation for one-third of the tested problems, especially for the problems for which the input data are irregular. The homotopy method can arrive at correct separation for all problems except one. It should be noted that the homotopy method usually takes more iterations compared to the conventional BP algorithm, because the former solves many minimization problems step by step as Xk decreases. This is a problem that needs further study. References Chow, S. N., Mallet-Paret, J., and Yorke, J. 1978. Finding zeros of maps: Homotopy methods that are constructive with probability one. Math. Cornp. 32, 887-889. Li, T. Y. 1987. Solving polynomial systems. The Math. Intelligencer 9, 33-39. Orfanidis, S. J. 1990. Gram-Schmidt neural nets. Neural Cornp. 2, 116-126. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representationsby error propagation. In Parallel DistributedProcessing, Vol. 1, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Weymaere, N., and Martens, J. P. 1991. A fast and robust learning algorithm for feedforward neural networks. Neural Networks 4, 361-369.
Received 26 June 1992; accepted 29 September 1992.
This article has been cited by: 1. Nandakishore Kambhatla , Todd K. Leen . 1997. Dimension Reduction by Local Principal Component AnalysisDimension Reduction by Local Principal Component Analysis. Neural Computation 9:7, 1493-1516. [Abstract] [PDF] [PDF Plus] 2. B. Verma. 1997. Fast training of multilayer perceptrons. IEEE Transactions on Neural Networks 8:6, 1314-1320. [CrossRef] 3. F.M. Coetzee, V.L. Stonick. 1996. On a natural homotopy between linear and nonlinear single-layer networks. IEEE Transactions on Neural Networks 7:2, 307-317. [CrossRef] 4. Frans M. Coetzee , Virginia L. Stonick . 1995. Topology and Geometry of Single Hidden Layer Network, Least Squares Weight SolutionsTopology and Geometry of Single Hidden Layer Network, Least Squares Weight Solutions. Neural Computation 7:4, 672-705. [Abstract] [PDF] [PDF Plus]
NOTE
Communicated by Halbert White
Improving Rejection Performance on Handwritten Digits by Training with ”Rubbish” Jane Bromley John S. Denker A T b T Bell Laboratories, Holmdel, NJ 07733, U S A
Introduction Very good performance for the classification of handwritten digits has been achieved using feedforward backpropagation networks (LeCun et al. 1990; Martin and Pittman 1990). These initial networks were trained and tested on clean, well-segmented images. In the real world, however, images are rarely perfect, which causes problems. For example, at one time one of our best performing digit classifiers interpreted a horizontal bar as a 2; in this example the most useful response would be to reject the image as unclassifiable. The aim of the work reported here was to train a network to reject the type of unclassifiable images (“rubbish”)typically produced by an automatic segmenter for strings of digits (e.g., zip codes), while maintaining its performance level at classifying digits, by adding images of rubbish to the training set.
Solution to the Problem Our data consisted of 39,740 handwritten characters, obtained from automatically segmented zip codes. The segmentation process used a number of heuristic algorithms that selected the best vertical cuts through the zip codes.’ Since the cuts were vertical and the heuristics imperfect, many of the zip codes were poorly segmented resulting in 94% images of single digits and 6% images of rubbish. An example segmentation is shown in Figure 1. There are, of course, innumerable other patterns that are not good digits, but the rubbish created in this way was particularly relevant to our task. This data was used to train two networks-GOOD
‘An improved zip code reader has since been developed,which relies less on heuristics and for which the training and segmenting are even more strongly coupled (Burges et al. 1992).
Neural Computation 5,367-370 (1993) @ 1993 Massachusetts Institute of Technology
368
Jane Bromley and John S. Denker
Figure 1: A typical zip code showing, with dotted lines, one possible, but incorrect, segmentation. We designated as "rubbish" any subimage that could not be identified as a single digit because (1) there was more than one digit present, (2) there was only a small part of a digit present, or (3) the image was not of a digit. All images were labeled by hand. During training, the desired output vector for rubbish images was chosen to have all neurons low, while for a digit the corresponding neuron was high, the rest low.
was trained on 27,359 images of good digits and GOOD+RUBBISH was trained on these plus an extra 1642 images of rubbish. The architecture and training using backpropagation of these networks are described by LeCun et al. (1990). In Table 1 we see that training on rubbish caused no degradation when the nets were tested on good digits only, while it distinctly improved the ability of the network to reject rubbish-it went from rejecting 29.4 to 20.9% of the test patterns for 1% error on the test set. The rejection criterion was based on the difference between the two highest network outputs with highest confidence being assigned to classifications with the largest difference between these two outputs. An experimental investigation of rejection criteria for this network architecture has been made and this criterion came out as about the best (private communication from Yann LeCun). (No improvement is to be expected in the raw error rate, since the rubbish digits were scored as errors for both networks in this case.) The MSE showed no significant variability.
Improving Performance on Handwritten Digits
369
Table 1: Comparison of the Performance of the Two Networks after 20 Passes Through Their Respective Training Sets# Tested on: Trained on:
GOOD
MSE Error rate Reject
.026 4.3 8.5
Good GOOD+RUBBISH
Good+Rubbish GOOD GOOD+RUBBISH
4.4
.031 10.3
8.3
29.4
.027
.029 10.4 20.9
‘Each network was tested on two different testing sets: one consisting of only good digits and the other containing 6% of rubbish. MSE is the analog mean square error between desired and actual outputs. Reject is the number of patterns that had to be rejected (by the network) to achieve a 1% error rate on the remaining test digits.
Conclusions 0
0
0
0
0
Neural networks only do what they are trained to do. Contrary to other findings (Lee 19911, neural networks can generate an effective confidence judgment for rejecting ambiguous inputs. Rubbish subimages are ambiguous digits, multiple digits, partial digits, and noise. These are common in real-world images of handwritten digits. Our results show that including rubbish images in the training set improves the performance of a neural network digit classifier at rejecting such patterns. Performance on well-segmented digits is unaffected by this extra training. Accurate rejection is crucial in a system that automatically segments multidigit images and relies on its classifier to accept or reject possible segmentations of the image (Matan et al. 1992). Using this classifier in such a system led to an improvement in the recognition rate per zip code from 69 to 78%. Even more importantly, there was a vast improvement in the rejection performance of the whole system. Zip codes were sorted according to the neural network confidence measure of their being correct. Prior to training on rubbish, when the first 60%correctly classified zip codes are accepted, 10%erroneously classified zip codes were also accepted, while after training only 3% erroneously classified zip codes were included.
Acknowledgments Support of this work by the Technology Resource Department of the U.S.Postal Service under contract number 104230-90-C-2456is gratefully acknowledged.
370
Jane Bromley and John S. Denker
References Burges, C. J. C., Matan, O., LeCun, Y., Denker, J. S., Jackel, L. D., Stenard, C. E., Nohl, C. R., and Ben, J. I. 1992. Shortest path segmentation: A method for training a neural network to recognize character strings. Proc. Intl. Joint Conf. Neural Networks, IEEE, 3, 165-172. LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Neural Information Processing Systems,D. s. Touretzky, ed., pp. 396-494. Morgan Kaufmann, San Mateo, CA. Lee, Y. 1991. Handwritten digit recognition using K nearest-neighbor, radialbasis function, and backpropagation neural networks. Neural Comp. 3,440449. Matan, O.,Bromley, J., Burges, C. J. C., Denker, J. S., Jackel, L. D., LeCun, Y., Pednault, E. P. E., Satterfield, W. D., Stenard, C. E., and Thompson, T. J. 1992. Reading handwritten digits: A zip code recognition system. IEEE Computer 25,5943. Martin, G. L., and Pittman, J. A. 1990. Recognizing hand-printed letters and digits. In Neural Information Processing Systems, D. S. Touretzky, ed., pp. 405414. Morgan Kaufmann, San Mateo, CA. Received 27 July 1992; accepted 20 October 1992.
This article has been cited by: 1. Cheng-Lin Liu, H. Sako, H. Fujisawa. 2004. Effects of classifier structures and training regimes on integrated segmentation and recognition of handwritten numeral strings. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:11, 1395-1407. [CrossRef] 2. C.-L. Liu, H. Sako, H. Fujisawa. 2004. Discriminative Learning Quadratic Discriminant Function for Handwriting Recognition. IEEE Transactions on Neural Networks 15:2, 430-444. [CrossRef] 3. D.R. Lovell, T. Downs, Ah Chung Tsoi. 1997. An evaluation of the neocognitron. IEEE Transactions on Neural Networks 8:5, 1090-1105. [CrossRef] 4. M. Revow, C.K.I. Williams, G.E. Hinton. 1996. Using generative models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 18:6, 592-606. [CrossRef] 5. G. C. Vasconcelos, M. C. Fairhurst, D. L. Bisset. 1995. Efficient detection of spurious inputs for improving the robustness of MLP networks in practical applications. Neural Computing & Applications 3:4, 202-212. [CrossRef]
NOTE
Communicated by Eric Baum
Vapnik-Chervonenkis Dimension Bounds for Two- and Three-Layer Networks Peter L. Bartlett' Department of Electrical and Computer Engineering, University of Queensland, Qld, 4072, Australia We show that the Vapnik-Chervonenkis dimension of the class of functions that can be computed by arbitrary two-layer or some completely connected three-layer threshold networks with real inputs is at least linear in the number of weights in the network. In Valiant's "probably approximately correct" learning framework, this implies that the number of random training examples necessary for learning in these networks is at least linear in the number of weights. This note addresses the question, How many training examples are necessary for satisfactory learning performance in a multilayer feedforward neural network used for classification? To define "satisfactory learning performance," we assume that the examples are generated randomly, and say that the trained network is approximately correct if it correctly classifies a random example with high probability. We require that the trained network will almost always be approximately correct, for any desired target function and any probability distribution of examples. This is known as "probably approximately correct" (or pac) learning (Valiant 1984). Blumer et al. (1989) show that the number of labelled examples necessary and sufficient for pac learning depends linearly on the VapnikCheruonenkis dimension (VC-dimension) of the set of functions that the learner can choose from. Definition 1. A class F of (0,1)-valued functions defined on a set X is said to shatter a finite set S C_ X if, for each of the 21sl classifications of the points in S, there is a function in F that computes the classification. The VC-dimension of F [written VCdim(F)l is the size of the largest subset of X that F shatters. We consider networks of processing units in layered, feedforward architectures with real-valued inputs and a single binary output. A feedfonvard architecture is a set of units (input units and processing units) arranged in a number of layers, and a set of connections, each of which joins 'Current address: Department of Systems Engineering, RSPhysSE, Australia National University, 0200, Australia.
Neural Computation 5,371-373 (1993) @ 1993 Massachusetts Institute of Technology
Peter L. Bartlett
372
one unit to another unit in a later layer. An L-layer network contains L layers of processing units. A feedfonoard threshold network consists of a feedforward architecture that has a particular real-valued weight and threshold associated with each connection and processing unit, respectively. Each processing unit in the network computes a linear threshold function, f ( x ) = 3-1 (Cixiwi - 0), where xi, wi, and O are the real-valued inputs to the unit, weights, and threshold, respectively, and %(a) is 1 if a 2 0 and 0 otherwise. Notice that a network consists of an architecture together with the weights and thresholds, so it computes a particular (0, 1)-valued function of its inputs. We refer to the VC-dimension of the class of functions that can be computed by threshold networks with a particular feedforward architecture A as the VC-dimension of that architecture, and write VCdim(A). The VC-dimension of an arbitrary feedforward architecture is not known precisely. Baum and Haussler (1989) show that the VC-dimension of a feedforward architecture with N processing units and W weights is no more than 2Wlog, eN (where e is the base of the natural logarithm), and that the VC-dimension of a completely connected two-layer architecture with ko input units and kl first-layer units is at least 2[k1/2]ko. (A completely connected multilayer network has connections between all pairs of units in adjacent layers.) In this note, we give lower bounds on the VC-dimension of arbitrary two-layer architectures and some completely connected three-layer architectures. By the results of Blumer et al. (1989) the bounds indicate in all cases that the sample size necessary for pac learning is at least proportional to W, the number of weights in the network. Proofs of the results are given in the full version of this note (Bartlett 1992). To show that the VC-dimension of an architecture is at least d, we can construct a shattered set of size d . The problem of constructing such a set can be decomposed by separately constructing defining sets for units in a network.
Definition 2. A set S = { x , , x2, . . . ,x,} c U& is a defining set for a unit u in a feedforward threshold network with ko real-valued inputs if 1. We can classify the points in S in each of the 21sI distinct ways by slightly perturbing the weights and threshold of unit u.
2. Slightly perturbing the weights and threshold of units other than u will not affect the classification of the points in S. A point x E I& is an oblivious point for this network if the classification of x is unaffected by sufficiently small perturbations of the network weights.
Theorem 3. Let A be a feedforward architecture. Consider a set of processing units U in this architecture and a threshold network N with architecture A that has an oblivious point. If there is a finite defining set S , for each unit u in U , then VCdim(A) 2 CuEU IS,[ 1.
+
Vapnik-Chervonenkis Dimension
373
By finding appropriate defining sets, we can use Theorem 3 to give the following lower bounds.
Theorem 4. Let A be an arbitrary two-layer feedforward architecture. If A has I connectionsfrom the input units to other units, then VCdim(A) 2 I + 1. Theorem 5. Let A be the three-layer, completely connected architecture with ko > 0 input units, kl > 0 first-layer units, k2 > 0 second-layer units, and a single output unit.
+
(a) If ko 2 k l , and kz 5 2&1/(g/2 k1/2 + l), then VCdim(A) 2 kokl ki(kz - 1) 1. (b) If 1 < ko < k1 2 k2, then VCdim(A) 2 kok1 kl(k2 - 1)/2 1.
+
+
+
+
These results imply that for learning two-layer networks or completely connected three-layer networks with kz not too large, the sample size must increase at least linearly with the number of weights. The results also give lower bounds for learning in networks of processing units with a sigmoid transfer function, since a sigmoid network can compute any function on a finite set that a threshold network can compute.
Acknowledgments This work was supported by the Australian Telecommunications and Electronics Research Board. Thanks to T. Downs, R. Lister, D. Lovell, R. Williamson, and S. Young for comments on a draft.
References Bartlett, P. L. 1992. Lower bounds on the Vupnik-Chervonenkisdimension of multi-layer threshold networks. Intelligent Machines Laboratory, Department of Electrical and Computer Engineering, University of Queensland, Brisbane, Australia, Tech. Rep. IML92/3. October 1992. Baum, E. B., and Haussler, D. 1989. What size net gives valid generalization? Neural Cornp. 1,151-160. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkisdimension. J. Assoc. Computing Machin. 36(4), 929-965. Valiant, L. G . 1984. A theory of the learnable. Commun. ACM 27(11), 1134-1143. Received 24 September 1992;accepted 29 September 1992.
This article has been cited by: 1. Hiroki Suyari, Ikuo Matsuba. 1999. Information theoretical approach to the storage capacity of neural networks with binary weights. Physical Review E 60:4, 4576-4579. [CrossRef] 2. Ansgar H L West, David Saad. 1998. Journal of Physics A: Mathematical and General 31:45, 8977-9021. [CrossRef]
Communicated by Richard Andersen
A Neural Network for the Processing of Optic Flow from Ego-Motion in Man and Higher Mammals Markus Lappe Josef P. Rauschecker Laboratory of Neurophysiology, N M H , Poolesville, M D 20837, USA and Max Planck Institute for Biological Cybernetics, Tubingen, Germany
Interest in the processing of optic flow has increased recently in both the neurophysiological and the psychophysical communities. We have designed a neural network model of the visual motion pathway in higher mammals that detects the direction of heading from optic flow. The model is a neural implementation of the subspace algorithm introduced by Heeger and Jepson (1990). We have tested the network in simulations that are closely related to psychophysical and neurophysiological experiments and show that our results are consistent with recent data from both fields. The network reproduces some key properties of human ego-motion perception. At the same time, it produces neurons that are selective for different components of ego-motion flow fields, such as expansions and rotations. These properties are reminiscent of a subclass of neurons in cortical area MSTd, the triple-component neurons. We propose that the output of such neurons could be used to generate a computational map of heading directions in or beyond MST.
1 Introduction
The concept that optic flow is important for visual navigation dates from the work of Gibson in the 1950s. Gibson (1950)showed that the optic flow pattern experienced by an observer moving along a straight line through a static environment contains a singularity that he termed the focus of expansion. He hypothesized that the visual system might use the global pattern of radial outflow originating from this singularity to determine the translational heading of the observer. A host of studies in human psychophysics have followed up Gibson’s ideas (Regan and Beverly 1982;Rieger and Toet 1985;Warren et al. 1988; Warren and Hannon 1988, 1990). Regan and Beverly (1982)rejected his hypothesis on the basis that the optic flow pattern that arrives on the Neural Computation 5,374-391 (1993) @ 1993 Massachusetts Institute of Technology
Neural Network for Processing Optic Flow
375
retina is radically altered by eye movements of the observer. Then the flow field becomes a superposition of the radial outflow pattern with a circular flow field that is obtained when the eyes move in the orbita. Generally the resulting vector field may also have a singular point similar to a focus of expansion, but this point does not necessarily coincide with the heading direction. If, for instance, the eye rotation results from the fixation of a point in the environment, the singularity will be at the fixation point instead of the destination point. Nevertheless, Warren and Hannon (1990) found humans capable of judging their heading with great accuracy from optic flow patterns that simulated translation plus eye rotation. Their subjects were able to perceive their heading with a mean error between one and two degrees solely from the optic flow. No nonvisual information such as oculomotor signals was necessary. This ability persisted over a natural range of speeds and over a variation of the number of visible moving points between 10 and several hundred. The performance of the subjects was at chance, however, when no depth information in the form of motion parallax was available. In the visual system there are at least two (maybe three) main streams of information flow (Mishkin etal. 1983; Livingstone and Hubel 1988; Zeki and Shipp 1988). In the simplest depiction, there is an inferotemporal system that is mainly responsible for the processing of form, and a parietal system that processes motion (Ungerleider and Mishkin 1982). Within the cortical motion system, one of the prominent and most investigated areas in primates is the middle temporal area or area MT (Allman and Kaas 1971). In cats the probable homologue for MT is the Clare-Bishop area (Clare and Bishop 1954), also called area PMLS (Palmer et al. 1978). Evidence from both areas suggests that they participate in the processing of flow field information. Both areas contain neurons that are highly direction selective and respond well to moving stimuli. It has first been found in cat area PMLS that a majority of neurons prefer movement away from the area centralis, that is, centrifugal motion (Rauschecker et al. 1987a,b; Brenner and Rauschecker 1990). The same has been found in monkey area MT (Albright 1989), thus strengthening the likelihood of a homology between these two areas. Other studies have revealed single neurons in PMLS that respond well to approaching or receding objects (Toyama et al. 1990). More recently, a number of studies have described neurons in the dorsal part of monkey area MST (MSTd) that respond best to large expanding/contracting, rotating, or shifting patterns (Tanaka and Saito 1989a,b; Andersen et al. 1990; Duffy and Wurtz 1991a,b). The response of these neurons often shows a substantial invariance to the position of the stimulus. Duffyand Wurtz (1991a,b) found that a majority of the neurons in MSTd responded not only to one component of motion of the stimulus pattern (e.g., expansion or contraction), but rather to two or all three of them separately. About one-third of MSTd cells displayed selectivity to
376
Markus Lappe and Josef I? Rauschecker
expansions or contractions and clockwise or counterclockwise rotations and showed broad directional tuning for shifting dot patterns when tested with these stimuli one after another. It is these "triple component cells" that our model is mainly concerned with. Furthermore, cells in MSTd are unselective for the overall speed of a stimulus and for the amount of depth information available in the stimulus. There have been a number of computational approaches to extract navigational information from optic flow focusing on different mathematical properties of the flow field. The difficulty of the task is that in the mapping of three-dimensional movements onto a two-dimensional retina some information is lost that cannot be fully recovered. Models that use differential invariants (Koenderink and van Doorn 1981; LonguetHiggins and Prazdny 1980; Waxman and Ullman 1985) require dense optic flow to compute derivatives. By contrast, humans are quite successful with sparse fields (Warren and Hannon 1990). Models based on algorithms that solve a set of equations for only a small number of vectors (Prazdny 1980; Tsai and Huang 1984), on the other hand, require precise measurements and are very sensitive to noise. Methods that rely on motion parallax or local differential motion (Longuet-Higgins and Prazdny 1980; Rieger and Lawton 1985) are in agreement with the psychophysical data in that they fail in the absence of depth in the environment. However, they require accurate measurements at points that are close to each other in the image but are separated in depth, which is an especially difficult task to accomplish. Furthermore, recent psychophysical studies (Stone and Perrone 1991) have shown that local depth variations are not necessary. Least-square minimization algorithms (Bruss and Horn 1983; Heeger and Jepson 1990) that use redundant information from as many flow vectors as are available are robust and comparatively insensitive to noise. None of the above-mentioned algorithms is clearly specified in terms of a neural model. Given the current advances in visual neurophysiology, it seems desirable to construct a neural network for ego-motion perception that is consistent with the neurophysiological and psychophysical data. Recently a network model of heading perception in the simpler case without eye movements has been described (Hatsopoulos and Warren 1991), which accounts for some psychophysical findings. A neural model we presented in brief form earlier together with first results from the model described in this paper (Lappe and Rauschecker 1991) is also concerned with pure translations. It uses a centrifugal bias similar to the one found in PMLS and MT to achieve precise heading judgments with neuronal elements that are as broadly directionally tuned as the cells found in these areas. In this article we present a new neural network that succeeds when the radial flow pattern is disturbed by eye movements. The network is capable of reproducing many of the psychophysical findings, and the
Neural Network for Processing Optic Flow
377
single units exhibit great similarity to the triple component cells of Duffy and Wurtz (1991a,b) in area MSTd. 2 The Model
Our network is built in two layers. The first layer is designed after monkey area MT and represents the input to the network. The second layer is constructed to yield a representation of the heading direction as the output of the net and thus could form a model of MSTd. In each network layer we employ a population encoding of the relevant variables, namely the speed and direction of local movements in layer one and the heading direction of the individual in layer two. The computation of the direction of translation is based on the subspace algorithm by Heeger and Jepson (1990). Its main course of action is to eliminate the dependencies on depth and rotation first and thereby gain an equation that depends only on the translational velocity. Therefore it bears some similarity to Gibson's original claim that the visual system can decompose the optic flow into its translational and rotational components. We will restrict the scope of our model to such eye movements as occur when the observer keeps his eyes fixed on a point in the environment while he is moving. This is a natural and frequently occurring behavior, and we believe that using assumptions that are a reflection of the behavior of an animal or a human being makes it more likely to gain results that can be compared with experimental data. Although it is mathematically possible to include any type of eye movements, it is not very likely that the eyes would rotate around their long axis to a significant amount during locomotion. Note that our assumption includes the case of no eye movements at all, since it can be described as gazing at a point infinitely far away. 2.1 Optic Flow and the Subspace Algorithm. Optic flow is the projection of the motion of objects in the three-dimensional world onto a two-dimensional image plane. In three dimensions, every moving point has six degrees of freedom: The translational velocity T = (Tx,T,,TJ and the rotation 52 = (ax, R,, R#. When an observer moves through a static environment all points in space share the same six motion parameters. The motion of a point R = (X, Y, Z)' in a viewer-centered coordinate system is V = -(a x R T). This motion is projected onto an image plane. Writing two-dimensional image vectors in small letters, the perspective projection of a point is r = ( x , y)' = f (X/Z, Y/Z)', where f denotes the focal length. Following Heeger and Jepson (1990) the image velocity can be written as the s u m of a translational and a rotational component:
+
Markus Lappe and Josef l? Rauschecker
378
where p ( x ,y) = 1/Z is the inverse depth, and
A(x,y)=
( -f
")
-f Y
0
The unknown depth and translational velocity are multiplied together and can thus only be determined up to a scale factor. Regarding therefore the translation T as a unit vector, one is left with six unknowns: p, the two remaining components of T, and the three components of 51, but only two known quantities, Ox, Oy. The subspace algorithm uses flow vectors at five distinct image points to yield an overdetermined system of equations that is solved with a minimization method in the following way: The five separate equations are combined into one matrix equation 0 = C(T)q where 0 = (el,.. . ,&)' is now a 10-dimensional vector consisting of the components of the five image velocities, q = [P(x~,yl), . . . ,p(x5,ys),R,, R,, R,]' an eight-dimensional vector, and C(T) a 10 x 8 matrix composed of the A(xi,yi)Tand B(xi,yi)matrices:
C(T) =
(
A h y1)T 1
d
B(x1,y1)
A(x5,y5)T B("5, y5)
Heeger and Jepson (1990) then show that the heading direction can be recovered by minimizing the residual function
R(T) = IIQ'C'(T)112, where C'-(T)is a matrix that spans the two-dimensional orthogonal complement of C(T). 2.2 Restrictionto Fixations during Locomotion. We now restrict ourselves to only those eye movements that arise through the fixation of a ' the center of the visual field while the observer point F = (O,O,l / p ~ ) in moves along a straight line. The rotation that is necessary to h a t e this point can be derived from the condition that the flow at this point has to be zero:
(:)=p.(
-f
0 0 O - f O f) 0 -f O)T+(f 0 0)
Choosing R, = 0 we find
f) = PF(T,,-Tx,O)*. The
optic flow then is:
Neural Network for Processing Optic Flow
379
The case of a straight translation without any eye movements can easily be described within this framework by considering a fixation point that is infinitely far away. Then PF and the rotational velocity s1 are zero, resulting in a purely translational flow. The optic flow equation above has only four unknowns: p ( x , y ) , PF, T,, and Ty.Combining the equations for two different flow vectors into one matrix equation in the same way as before yields 0 = C (T) [ P ( x I , ~ ~ ) , P ( x ~ , Y ~ ) ,where ~ F ] ' , C(T) is now only a 4 x 3 matrix, the orthogonal complement of which is a line given by the vector C'(T). The residual function becomes the scalar product between this vector and the observed flow:
R(T) = IQtC'(T)I2
(2.2)
Since the optic flow is a linear function of the translational direction, R(T) does not have a single minimum but is equal to zero along a line in the (T,, Ty ) .plane. Therefore one such minimization alone cannot give the translational velocity, rather several pairs of flow vectors with different R(T) functions have to be used in conjunction. 2.3 The Network. In the first layer of the network, which constitutes the flow field input, 300 random locations within 50" of eccentricity are represented. We assume a population encoding of the optical flow vectors at each location by small sets of neurons that share the same receptive field position but are tuned to different directions of motion. Each such group consists of n' neurons with preferred directions ek, k = 1, . . . ,n'. The flow vector 8 is represented by the sum over the neuronal activities sk in the following way: n'
8 = Cskek
(2.3)
k=l
We do not concern ourselves with how the optic flow is derived from the luminance changes in the retina or how the aperture problem is solved. Neural algorithms that deal with these questions have already been developed (Bulthoff etal. 1989; Hildreth 1984; Yuille and Grzywacz 1988). A physiologically plausible network model that yields as its output a population encoding like the one we use here has been proposed by Wang et al. (1989). It can be thought of as a preprocessing stage to our network, modeling the pathway from the retina to area MT or PMLS. Since we start out with a layer in which the optic flow is already present, we have to guarantee that the tuning curves of the neurons and the distributions of the preferred directions match the requirement of equation 2.3. As the simplest choice for our model, we use a rectified cosine function with n' = 4. It preserves the most prominent feature of the observed directional tuning curves in MT/PMLS,namely broad
Markus Lappe and Josef P. Rauschecker
380
unidirectional tuning with no response in the null direction. The preferred directions are equally spaced, e k = [cos(nk/2),sin(xk/2)], and for the unit's response to a movement with speed 60 and direction $, the tuning curve is
6, cos(4 - nk/2) if cos($ - nk/2) > 0 s k = {
0
otherwise
The second layer represents a population encoding of the translational direction of the movement of the observer, which is represented by the intersection point of the 3D-movement vector T with the image plane. There are populations of n neurons at possible intersection points whose combined activities U I give the perceived direction. But here the sum of the activities U = Cbl U Iat each position yields a measure of how likely this position is to be the correct direction of movement. The perceived direction is chosen to be the one that has the highest total activity. The output of a second layer neuron is a sigmoid function g(x) of the sum of the activities of its m input neurons weighted by synaptic strengths Ijkl and compared to a threshold p:
Here IikI denotes the strength of the connection between the Ith output neuron and the kth input neuron in the population that represents image location i. The sigmoid function is symmetric such that g ( - x ) = 1-g(x). The connections and their strengths are set once before the network is presented with any stimuli, and are fixed afterward. First a number of image locations are randomly assigned to a second layer neuron. Then, values for the synaptic strengths are calculated so that the population of neurons encoding a specific T is maximally excited when R(T) equals zero. Although the neuron may receive input from a large number of image locations we start the calculation of the connections with only two in order to keep it simple. We want the sum in equation 2.4 to equal the scalar product on the right side of equation 2.2:
For every single image location i we have
Substituting equation 2.3 we find
Neural Network for Processing Optic Flow
381
Therefore we set the synaptic strengths to
If the neuron is connected to more than two image locations the input connections are divided into pairs and the connections are calculated separately for each pair. Now the question of when R(T) is minimal comes down to the question of when all the neurons’ inputs balance each other to give a net input of zero. Consider two output neurons U Iand UI’receiving input from the same set of first layer neurons but with inverse connections such that lik,’ = -Jiu. Then, if the threshold ,LA equals zero, the sum of both neurons’ activities is equal to 1 regardless of their inputs, since the sigmoid input/output function is symmetric. If, however, ,LA has a slightly negative value, both sigmoid functions will overlap and the sum will have a single peak at an input value of zero. Such a matched pair of neurons generates its maximal activity when R(T)= 0. MSTd neurons have very large receptive fields and do certainly receive input from more than 2 image locations. Also MSTd neurons show the same response in the case of as little as 25 visible moving dots as they do in the case of 300 (Duffyand Wurtz 1991a). We chose each of our model neurons to receive input from 30 image locations. We restrict the space for the encoded heading directions to the innermost 20 x 20” of the visual field, since this approximates the range over which the psychophysical experiments have been carried out. Nevertheless, each layer-two neuron may receive input from a much larger part of the visual field. The layer-two neurons form a three-dimensional grid with 20 x 20 populations encoding one degree of translation-space each, and 20 pairs of neurons in each population. 3 Results 3.1 Comparison of the Network’s Performance with Human Psychophysical Data. The network was tested with simulated flow fields with different motion parameters. We used a cloud-like pattern that consisted of a number of dots, the depths of which were randomly distributed within a given range. To test the behavior without eye movements a translational direction was randomly chosen within the innermost 20 x 20” and the rotation was set to zero. To test cases with eye rotation the translational direction was again chosen randomly and the fixation point was set in the center of the image plane and assigned a specific depth. The rotational component was then calculated from the condition that the flow at the fixation point must be zero. Each simulation run consisted of 100 presentations of different flow fields, after which we calculated the mean error as the mean angular difference between the network’s computed direction and the correct direction.
Markus Lappe and Josef P. Rauschecker
382
12
1-
-
10 -
1 e
4 -
.
2-
". ........................... 04
0
2
4
6
8
10
12
14
16
18
20
Number of flow vectors
Figure 1: Performance with sparse flow fields. The heading error becomes small with as little as 10 vectors. The number of dots necessary is about the same with or without eye movements.
We found the network's performance to be well within the range of human performers (Warren et al. 1988). For pure translation as well as with eye movements the mean error settled between 0.5 and 1.5", showing that the network always has its activity maximum at a position close to the translational direction. Consistent with the experiments of Warren et al. (1988) we found very little influence of speed on the performance of the network. Humans are able to detect their heading with very sparse flow fields consisting of only ten dots (Warren and Hannon 1988,1990). In order to test how many flow vectors are needed in our model under otherwise optimal conditions we made an additional assumption: We assumed that a given pair of vectors in the flow field serves as input to at least one pair of neurons in each population of the output layer. If this were not the case, some populations would receive more information than others and the number of dots neccessary for correct heading estimation would depend on the heading direction. Our assumption ensures that all heading directions are represented equally. Considering the large number of cortical neurons this assumption is biologically reasonable since it would be approximately fulfilled if the number of neurons in the output layer were large. For the simulations, we distributed the connections between input and output neurons in such a way as to fulfill the assumption. The results of the simulations are shown in Figure 1. The cloud of dots extended in depth from 11 to 31 m with a fixation point at 21 m. The translational
Neural Network for Processing Optic Flow
383
speed was 2 m/sec. In both the pure translation and in the eye rotation case the network started to detect the heading with the desired accuracy at approximately 10 points, although with eye rotation the error did not quite reach the optimum and continued to decrease as more flow vectors were provided. Mathematically two vectors are sufficient to compute the heading of a purely translational movement (Prazdny 19801, but humans fail to detect their heading with only two visible dots (Warren et al. 1988). Our network does not know a priori if the flow field is generated by a translation alone. It therefore has to rely on the flow pattern and needs about the same number of vectors as with the eye movements. Humans also fail when eye rotations are paired with a perpendicular approach to a solid wall, where all points are at the same depth (Rieger and Toet 1985; Warren and Hannon 1990). In this case the subjects' performances are at chance and they often report themselves as heading toward the fixation point. Because of a well-known ambiguity in planar flow fields (Tsai and Huang 19841, we were not able to test the depth dependence of the network with approaches to a plane at different angles. We therefore varied the depth range of the cloud. Doing this revealed that with decreasing depth the peak in the second layer grows broader and covers the fixation point as well as the heading direction. This can be seen in Figure 2 where the summed population activities in the output layer are shown on a grayscale map, together with reduced pictures of the input flow fields. Input and output are compared for situations that differ in the amount of depth in the image. In Figure 2a a flow field is shown in which the depth range of the cloud of dots is large, extending from 7 to 30 m. The observer moves toward the cross while he is keeping his eyes fixed on an object ( x ) in the center. There is no apparent focus of expansion. The network output (Fig. 2b) shows an easily localizable brightness peak in the upper left that corresponds to the correct heading direction as indicated by the cross. Figure 2c shows the same movement as Figure 2a, but here the depth range of the cloud is much smaller, ranging from 19 to 21 m. In this case the flow field looks very much like an expansion centered at the fixation point. In the corresponding network output (Fig. 2d), the peak is very broad and includes the fixation point in the center. A maximum nevertheless still exists, although much less pronounced, and in the simulations the network was still able to compute the right heading. However, the solution is unstable and very sensitive to noise. To illustrate this, we randomly varied the amplitudes of the flow vectors by stretching them by a factor distributed uniformly between 0.9 and 1.1, thus adding 10% noise. The results for all conditions are shown in Figure 3 for different depth ranges. This small amount of noise increases the error for the rotational movement to around 7", whereas in the purely translational case the network performance is unaffected. With growing depth differences this separation becomes less pronounced and the error values for the rotational case decrease.
384
Markus Lappe and Josef P. Rauschecker
Figure 2 Influence of image depth on the heading judgment of the network. (a) Depth-rich flow field. Movement is toward the cross (+) while the x in the center is hated. (b) Output of the network. The response peak gives the correct heading. (c) Same movement with only little depth differences. (d) Brightness maximum in the output of the network is very broad and includes the fixation point. 3.2 Comparison with Single Cell Properties in MSTd. The output layer cells of our model network exhibit a remarkable resemblance to some triple component neurons in MSTd. Figure 4 shows the response of one output layer cell to presentations of each of the components (e.g., expansions, rotations) at different places in the visual field. The neuron
Neural Network for Processing Optic Flow
385
-
.......... ..........
2-
Eye8 lixed on wga
. no nOiM
No aye movaments . no noiw E u a lixed on w0.1 - 10% noiw
.....
I....
'i.,
............................................................
* ..............................
Figure 3 Heading error versus depth. In the noise-free condition, heading calculation is accurate despite the broad peak in the network output depicted in Figure 2d. Adding a small amount of noise, however, shows that the solution in the eye movement case is unstable and gives rise to a large error. receives input from 30 positions distributed inside a 60 x 60" receptive field centered in the lower right quadrant of the visual field and extending u p to 10" into each of the neighboring quadrants, thus including the vertical and horizontal meridians and the fovea or area centralis (Fig. 4a). This receptive field characteristic is common for MSTd neurons ( h f f y and Wurtz 1991b). The neuron in our example is a member of the population that represents a heading direction in the upper right quadrant at an eccentricity of 11". Figure 4b shows the cell's broad unidirectional tuning and little selectivity for stimulus speed. The plots c-f in Figure 4 illustrate the responses of the neuron to expansions, contractions, clockwise rotations, and counterclockwise rotations, respectively. The ( x , y)-plane represents a visual field of 100 x loo", the height is the response of the neuron to a stimulus centered at (x,y). The size of the stimulus was always large enough to cover the whole receptive field of the cell. For a stimulus in the center of the visual field the cell responds favorably to counterclockwise rotations and expansions, although there also is a smaller response to contractions. There are very large areas of position invariance covering almost half of the visual field for a given stimulus movement. The response to counterclockwise rotations, for instance, is constant in most of the upper two quadrants. The cell also shows the reversals in selectivity observed in 40% of triple-component neurons in MSTd (Duffyand Wurtz 1991b). In our example, moving the center of the stimuli to the right causes the response
386
Markus Lappe and Josef P. Rauschecker
Figure 4 Responses of one output layer cell. (a) Receptive field of the cell as defined by its input connections. (b) Broad unidirectional response to global shifts of a dot pattern. No tuning to a particular stimulus speed. (c-f) Responses to expanding, contracting, and rotating patterns centered at different positions within the visual field reveal large areas of position invariance and sudden reversals of selectivity.
Neural Network for Processing Optic Flow
387
to contractions to disappear. Moving the center of the stimuli to the lower left causes the cell's selectivity to change to favor contractions and clockwise rotations. There are intermediate positions where the cell responds to both modes of one component. For example, in plots b and c, there is a vertical strip in the center where the cell responds to expansions as well as to contractions. The response reversals take place along edges running across the visual field, which is similar to the findings of h f f y and Wurtz (1991b). The reason for this is that the residual function, which is computed by the neuron, equals zero along a line in the (Tx,Ty)space, as mentioned before. The edge of the surface that marks the neuron's response to expansions follows this line. The neuron signals only that the heading direction lies somewhere along the edge. The edges of all neurons in one population overlap at the point that corresponds to the heading represented by that population, When the network is presented with a flow field, the population encoding the correct heading is maximally excited since all of their neurons will respond. In populations representing other directions, only part of the neurons will be active, so that the total activity will be smaller. It is worth noting that the edges of reversal do not necessarily cross the receptive field of the cell. In the example of Figure 4, the reversal from selectivity for expansion to selectivity for contraction takes place in the left half of the visual field outside the cell's receptive field, which occupies the lower right quadrant. Likewise, it sometimesoccurred in the simulations that the reversal for rotation was not even contained within the 100 x 100" visual field. Another interesting observation is that the edges for rotation and expansion/contraction often cross each other approximately orthogonally. The position of the intersectionpoint, on the other hand, can vary widely between cells.
4 Discussion
We have designed a neural network that detects the direction of egomotion from optic flow and is consistent with recent neurophysiological and psychophysical data. It solves the traditional problem of eye movements distorting the radial flow field by means of a biologically reasonable mechanism. The model reproduces some key properties of human ego-motion perception, namely, the ability to function consistently over a range of speeds, the ability to work with sparse flow fields, and the difficulties in judging the heading when approaching a wall while moving the eyes. The network also generates interesting neuronal properties in its output layer. Simple intuitive models for heading perception might expect a
388
Markus Lappe and Josef P.Rauschecker
single neuron to show a peak of activity for an expansion at a certain preferred heading direction. Instead, our model uses a population encoding in which single cells do not carry all the information about the perceived heading, but rather the combined activity of a number of cells gives that information. At the level of a single neuron, the position information is contained in the edges of reversal of the cell's preferred direction of stimulus motion. The resulting characteristics of the output neurons in our network show great similarity to the response properties of a particular cell class recently described in MSTd, the triplecomponent neurons (Duffy and Wurtz 1991a,b). These cells, which comprise about one-third of all neurons in MSTd, display selectivity not only for expansion or contraction, but also for one type of rotation and one direction of shifting patterns. Most of the neuronal outputs produced by our network have similar properties. It appears tempting to postulate, therefore, that the output of triple-component cells could be used to compute directional heading, either within MST or in another area. A potential problem for using the output of MSTd neurons to compute heading direction concerns their apparent position invariance. In a neural network that is supposed to signal the directional heading, the response of the output layer cells has to depend on the position of the stimulus in some way. Most neurons in MSTd seem to be insensitive against changes of stimulus position, although the proportions of position invariant cells reported in different studies vary and obviously depend on the exact stimulus paradigm (Andersen etal. 1990; Duffy and Wurtz 1991b; Orban et al. 1992). In our network model many output neurons would appear position invariant when tested over a limited, wide range of stimulus positions. Interestingly, the proportion of position dependent responses seems to be highest among triple-component neurons (Duffy and Wurtz, 1991b): In about 40% of these cells component selectivity for a flow field stimulus is reversed along oriented edges, which conforms exactly with the behavior of our model neurons. It is conceivable, therefore, that it is this subtype of triple-component neurons that is involved in the computation of heading direction. More neurons of this type might be encountered in MSTd if one specifically looks for them. Their frequency of occurrence may depend on laminar position, or they might be found even more frequently at another processing stage. A closer look at the experimental data reveals that the number of triple component cells in MSTd may indeed have been underestimated. The different cell types in MSTd do not fall in strictly separate classes but rather form a continuum changing smoothly from triple to single component cells (Duffy and Wurtz 1991a). Therefore, double and single component cells might be regarded as possessing some, albeit weak, responses to the other components. It is equally possible, however, that single and double component cells simply do not participate in the detection of heading direction, but serve some other purpose. Single com-
Neural Network for Processing Optic Flow
389
ponent cells, for example, could be involved in the analysis of object motion. The network can also generate cells that are selective to fewer components when the restriction is removed that rotations are due to the fixation of an object. Allowing arbitrary rotations, including ones around a sagittal axis through the eye, results in neurons that are unselective for rotations and respond only to translations and expansions/contractions. Under the different assumption that only frontoparallel rotations, including for instance pursuit eye movements, will occur, the neurons show strong, fully position invariant responses to rotational stimuli, which dominate over the selectivity for translation and expansion/contraction (Lappe and Rauschecker 1993). We would like to emphasize that the neurons in our model do not decompose the flow field directly. At no point is the translational part of the optic flow actually computed. The neurons rather test the consistency of a measured optic flow with a certain heading direction. In this way, a response selectivity for rotations, for example, does not mean that the neuron is actually tuned to the detection of a rotation in the visual field, but this property rather has to be regarded as the result of a more complex selectivity. The cells in the output layer of our model form a computational map of all possible heading directions. However, it would not be easy to find this map in an area of visual cortex, since the topography reveals itself only in the properties of cell populations. Simultaneous recording from an array of electrodes would perhaps be the only way to demonstrate this computational map experimentally. Our model suggests that one has to focus on the mapping of selectivity reversals and explore these more thoroughly, especially in triple component cells: Neurons in neighboring columns should show smooth shifts of their preferences. The concurrent activity of such cells in a hypercolumn would signal one particular heading direction in space, which is given by the intersection point of their reversal edges for expansion and contraction.
References Albright, T. D. 1989. Centrifugal directionality bias in the middle temporal visual area (MT) of the macaque. Visual Neurosci. 2, 177-188. Allman, J. M., and Kaas, J. H. 1971. A representation of the visual field in the caudal third of the middle temporal gyrus of the owl monkey (Aotus trivirgatus). Brain Res. 31, 85-105. Andersen, R., Graziano, M., and Snowden, R. 1990. Translational invariance and attentional modulation of MST cells. SOC.Neurosci. Abstr. 16, 7 . Brenner, E., and Rauschecker, J. I? 1990. Centrifugal motion bias in the cat’s lateral suprasylvianvisual cortex is independent of early flow field exposure. J. Physwl. 423,641460.
390
Markus Lappe and Josef P. Rauschecker
Bruss, A. R., and Horn, B. K. I? 1983. Passive navigation. Computer Vision, Grahics, Image Process. 21, 3-20, Bulthoff, H., Little, J., and Poggio, T. 1989. A parallel algorithm for real-time computation of optical flow. Nature (London) 337,549-553. Clare, M. H., and Bishop, G. H. 1954. Responses from an association area secondarily activated from optic cortex. 1.Neurophysiol. 17,271-277. Duffy, C. J., and Wurtz, R. H. 1991a. Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli. J. Neurophysiol. 65(6), 1329-1345. Duffy, C. J., and Wurtz, R. H. 1991b. Sensitivity of MST neurons to optic flow stimuli. 11. Mechanisms of response selectivity revealed by small-field stimuli. 1.Neurophysiol. 65(6), 1346-1359. Gibson, J. J. 1950. The Perception of the Visual World. Houghton Mifflin, Boston. Hatsopoulos, N. G., and Warren, W. H., Jr. 1991. Visual navigation with a neural network. Neural Networks 4(3),303-318. Heeger, D.J., and Jepson, A. 1990. Visual perception of three-dimensional motion. Neural Comp. 2, 129-137. Hildreth, E. C. 1984. The Measurement of Visual Motion. MIT, Cambridge, MA. Koenderink, J. J., and van Doom, A. J. 1981. Exterospecific component of the motion parallax field. 1.Opt. SOC.Am. 71(8), 953-957. Lappe, M., and Rauschecker, J. P. 1991. A neural network for flow-field processing in the visual motion pathway of higher mammals. SOC.Neurosci. Abstr. 17,441. Lappe, M., and Rauschecker, J. P. 1993. Computation of heading direction from optic flow in visual cortex. In Advances in Neural Information Processing Systems, Vol. 5, C. L. Giles, S. J. Hanson, and J. D.Cowan, eds. (in press). Morgan Kaufmann, San Mateo, CA. Livingstone, M., and Hubel, D. 1988. Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science 240, 740-749. Longuet-Higgins, H. C., and Prazdny, K. 1980. The interpretation of a moving retinal image. Proc. R. SOC. London B 208,385-397. Mishkin, M.,Ungerleider, L. G., and Macko, K. A. 1983. Object vision and spatial vision: Two cortical pathways. Trends Neurosci. 6,414-417. Orban, G. A., Lagae, L., Verri, A., Raiguel, S., Xiao, D.,Maes, H., and Torre, V. 1992. First-order analysis of optical flow in monkey brain. Proc. Nutl. Acad. Sci. U.S.A. 89, 2595-2599. Palmer, L. A., Rosenquist, A. C., and Tusa, R. J. 1978. The retinotopic organization of lateral suprasylvian visual areas in the cat. 1. Comp. Neurol. 177, 237-256. Prazdny, K. 1980. Egomotion and relative depth map from optical flow. Biol. Cybern. 36,87-102. Rauschecker, J. P., von Griinau, M. W., and Poulin, C. 1987a. Centrifugal organization of direction preferences in the cat’s lateral suprasylvian visual cortex and its relation to flow field processing. J . Neurosci. 7(4),943-958. Rauschecker, J. P., von Griinau, M. W., and Poulin, C. 1987b. Thalamocortical connections and their correlation with receptive field properties in the cat’s lateral suprasylvian visual cortex. Exp. Brain Res. 67, 100-112.
Neural Network for Processing Optic Flow
391
Regan, D., and Beverly, K. I. 1982. How do we avoid confounding the direction we are looking and the direction we are moving? Science 215, 194-196. Rieger, J. H., and Lawton, D. T. 1985. Processing differential image motion. J. Opt. SOC.Am. A 2, 354-360. Rieger, J. H., and Toet, L. 1985. Human visual navigation in the presence of 3-D rotations. Biol. Cybern. 52, 377-381. Stone, L. S., and Perrone, J. A. 1991. Human heading perception during combined translational and rotational self-motion. In SOC. Neurosci. Abstr. 17, 857. Tanaka, K., and Saito, H.-A. 1989a. Analysis of motion of the visual field by direction, expansion/contraction, and rotation cells clustered in the dorsal part of the medial superior temporal area of the macaque monkey. J. Neuropkysiol. 62(3), 626-641. Tanaka, K., and Saito, H.-A. 1989b. Underlying mechanisms of the response specificity of expansion/contraction and rotation cells in the dorsal part of the medial superior temporal area of the macaque monkey. J. Neuropkysiol. 62(3), 642-656. Toyama, K., Fujii, K., and Umetani, K. 1990. Functional differentiation between the anterior and posterior Clare-Bishop cortex of the cat. Exp. Brain Res. 81, 221-233. Tsai, R. Y., and Huang, T. S. 1984. Uniqueness and estimation of three-dimensional motion parameters of rigid objects with curved surfaces. IEEE Trans. Pattern Anal. Machine lntelligence 6, 13-27. Ungerleider, L. G., and Mishkin, M. 1982. Two cortical visual systems. In Analysis of Visual Behavior, D. J. Ingle, M. A. Goodale, and R. J. W. Mansfield, eds., pp. 549-586. MIT Press, Cambridge, MA. Wang, H. T., Mathur, 8. P., and Koch, C. 1989. Computing optical flow in the primate visual system. Neural Comp. 1(1), 92-103. Warren, W. H., Jr., and Hannon, D. J. 1988. Direction of self-motion is perceived from optical flow. Nature (London) 336, 162-163. Warren, W. H., Jr., and Hannon, D. J. 1990. Eye movements and optical flow. J. Opt. SOC.Am. A 7(1), 160-169. Warren, W. H., Jr., Morris, M. W., and Kalish, M. 1988. Perception of translational heading from optical flow. J. Exp. Psyckol.: Human Percept. Perform. 14(4), 646460. Waxman, A. M., and Ulllgan, S. 1985. Surface structure and three-dimensional motion from image flow: A kinematic analysis. lnt. J . Robotics Res. 4, 72-94. Yuille, A. L., and Grzywacz, N. M. 1988. A computational theory for the perception of coherent visual motion. Nature (London) 335, 71-74. Zeki, S., and Shipp, S. 1988. The functional logic of cortical connections. Nature (London) 335,311-317. Received 7 January 1992; accepted 2 October 1992.
This article has been cited by: 2. Frank Bremmer, Michael Kubischik, Martin Pekel, Klaus-Peter Hoffmann, Markus Lappe. 2010. Visual selectivity for heading in monkey area MST. Experimental Brain Research 200:1, 51-60. [CrossRef] 3. Douglas A. Hanes, Julia Keller, Gin McCollum. 2008. Motion parallax contribution to perception of self-motion and depth. Biological Cybernetics 98:4, 273-293. [CrossRef] 4. Jason T. Richards, Ajitkumar P. Mulavara, Jacob J. Bloomberg. 2004. Postural Stability During Treadmill Locomotion as a Function of the Visual Polarity and Rotation of a Three-Dimensional Virtual EnvironmentPostural Stability During Treadmill Locomotion as a Function of the Visual Polarity and Rotation of a Three-Dimensional Virtual Environment. Presence: Teleoperators and Virtual Environments 13:3, 371-384. [Abstract] [PDF] [PDF Plus] 5. Markus Lappe, Charles J. Duffy. 1999. Optic flow illusion and single neuron behaviour reconciled by a population model. European Journal of Neuroscience 11:7, 2323-2331. [CrossRef] 6. Seth Cameron , Stephen Grossberg , Frank H. Guenther . 1998. A Self-Organizing Neural Network Architecture for Navigation Using Optic FlowA Self-Organizing Neural Network Architecture for Navigation Using Optic Flow. Neural Computation 10:2, 313-352. [Abstract] [PDF] [PDF Plus] 7. Richard A. Andersen, Lawrence H. Snyder, David C. Bradley, Jing Xing. 1997. MULTIMODAL REPRESENTATION OF SPACE IN THE POSTERIOR PARIETAL CORTEX AND ITS USE IN PLANNING MOVEMENTS. Annual Review of Neuroscience 20:1, 303-330. [CrossRef] 8. A V. van den Berg, J A. Beintema. 1997. Motion templates with eye velocity gain fields for transformation of retinal to head centric flow. NeuroReport 8:4, 835-840. [CrossRef] 9. J-C Hsu, R-H Lin, E C Yeh. 1997. Vision-based motion measurement by directly extracting image features for vehicular steering control. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering 211:4, 277-289. [CrossRef] 10. Markus Lappe. 1996. Functional Consequences of an Integration of Motion and Stereopsis in Area MT of Monkey Extrastriate Visual CortexFunctional Consequences of an Integration of Motion and Stereopsis in Area MT of Monkey Extrastriate Visual Cortex. Neural Computation 8:7, 1449-1461. [Abstract] [PDF] [PDF Plus] 11. Ruye Wang . 1995. A Simple Competitive Account of Some Response Properties of Visual Neurons in Area MSTdA Simple Competitive Account of Some Response Properties of Visual Neurons in Area MSTd. Neural Computation 7:2, 290-306. [Abstract] [PDF] [PDF Plus]
12. Markus Lappe, Josef P. Rauschecker. 1995. Motion anisotropies and heading detection. Biological Cybernetics 72:3, 261-277. [CrossRef]
Communicated by Richard Durbin and Graeme Mitchison
Arbitrary Elastic Topologies and Ocular Dominance Peter Dayan Computational Neurobiology Laboratory, The Salk Institute,
PO.Box 85800, San Diego, CA 92186-5800 USA The elastic net, which has been used to produce accounts of the formation of topology-preserving maps and ocular dominance stripes (OD), embodies a nearest neighbor topology. A Hebbian account of OD is not so restricted-and indeed makes the prediction that the width of the stripes depends on the nature of the (more general) neighborhood relations. Elastic and Hebbian accounts have recently been unified, raising a question mark about their different determiners of stripe widths. This paper considers this issue, and demonstrates theoretically that it is possible to use more general topologies in the elastic net, including those effectively adopted in the Hebbian model. 1 Introduction
Durbin and Willshaw’s (1987) elastic net algorithm for solving the traveling salesperson problem (TSP) is based on a method for developing topology-preserving maps between the eye and brain (or lateral geniculate nucleus and cortex) due to von der Malsburg and Willshaw (1977) and Willshaw and von der Malsburg (1979). The elastic algorithm inspired a host of similar ones aimed at different optimization tasks, one of which is this topology problem, augmented by two associates-forming ocular dominance stripes and orientation selective cells (Goodhill and Willshaw GW 1990; Durbin and Mitchison 1990). SimiL. (1990, 1991) and Yuille (1990) looked at the relationship between elastic algorithms and Hebbian inspired ones (Hopfield and Tank 19851, showing that both mechanisms could be viewed as optimizing the same functions, albeit implementing differently the constraints (for the TSP, that each city should be visited exactly once). More recently, Yuille, Kolodny, and Lee (YKL 1991) repeated the feat and aligned elastic and Hebbian (Miller, Keller, and Stryker, MKS 1989) accounts of ocular dominance. The elastic net for the TSP consists of a set of points on a computational rubber band, pulled by forces toward the cities that have to be visited and by tension. The energy in a stretched rubber band is proportional to the square of its extension, which is incorrect for modeling the length of a tour (proportional just to the extension, in this model), Neural Computation 5, 392-401 (1993) @ 1993 Massachusetts Institute of Technology
Arbitrary Elastic Topologies
393
but Durbin (cited as a personal communication in Yuille 1990) suggests that changing the elastic net to use the absolute distance rather than its square is infelicitous. Hopfield and Tank’s (1985) model does in fact use the actual distances, and so, as they lucidly discuss, SimiC‘s and Yuille’s match between the elastic and Hebbian algorithms is not perfect. The nature of the topologies is even more mysterious in the match between Hebbian and elastic algorithms for ocular dominance. Topology enters MKS’ model through a cortical interaction function, which involves more than just the nearest neighbors. Conversely, these are the natural topology for the elastic version. This is one factor leading to an apparent difference between the predictions of MKS and GW. MKS suggested that the width of ocular dominance stripes is governed by the width of the cortical interaction function, whereas GW predicted that it is dependent on the relative correlations within and between the two eyes. This paper considers the issue by examining the two models of ocular dominance. The next section reviews YKLs analysis, and Section 3 looks at generalizing the nearest neighbor topology, testing the generalization in a onedimensional version of ocular dominance. 2 Yuille, Kolodny, and Lee’s Analysis
YKL un* the two models through the medium of a single cost function, which defines a generalized deformable model:’ E [VL,V R ,Y]
where V$ and V: are the variables matching the ith unit in the left and right eyes (more correctly lateral geniculate nucleus layers), respectively, to the ath unit in the cortex, xf and x; are the retinal “positions” of the ith unit in the left and right eyes, ya is the “position” of the ath unit in the cortex, and Y I {yo}. As GW and YKL say, these “positions” are defined somewhat abstractly; however, they are intended to capture something like the correlations between the firings of the retinal and cortical units. u is a constant that governs the relative weighting of the first term, which measures how close, correlationally, matching cells are, and the second term, which measures how close neighboring cortical cells are. This cost function owes its power of unification to having both matching V and continuous Y variables. The constraint on both retinal and cortical fields on a solution-that each cell should have a unique partner-is effectively duplicated in these ‘For convenience, this paper will look at the onedimensional versions of the various tasks. Extensions to the second dimension are straightforward, but messy, Also, YKL separate out the retinotopy dimension-whereas it is incorporated here into the continuous variables x and Y. MKS arbor functions are also neglected.
Peter Dayan
394
two sets of variables.* Minimizing E subject to these constraints leads to the optimal map. Hebbian and elastic methods are effectively different ways of minimizing this function, imposing different constraints in different manners on the way to deriving a solution. Both use Hopfield and Tank's key insight for the TSP that the constraints need not all hold throughout the optimization process, so long as they are guaranteed to be satisfied by the time the algorithm terminates. The reduction to an elastic net comes from eliminating the V Land V R variables using a Gibbs trick. The probability of a particular assignment of V and Y is declared to be proportional to e-flEIVL~VR~~, and these terms are summed over the set of V Rand V Lthat satisfy the partial constraint that each cell in the cortex maps to a unit in either the left or the right eye, but not both. The resulting elastic energy function is3
v
+5
c IYrl - Ya+*12
(2.2)
a
Note that the topology term survives this reduction intact, since it does not depend on the V . The alternative to eliminating the V variables is to eliminate the Y. YKL do this by regarding E[VL,VR,Y] as a quadratic form in Y, which has a minimum at Y,[VL,VR].Imposing the normalization constraint (see MKS) that each cortical cell receives a constant weight from the retina:
gives
where X L = {#} and X R = {xf}, and
21n terms of these variables: for each a, one of the collection over i of { V i ,V j } should be 1 and all the vt 0, and for each i the same should be true of the collection,overa. Also, for each a, ya should be the same as one # or .;",and for each i, there should be different 111 and a2 such that yal = # and y, = .;". 3Here and throughout, boundary conditions are avoided by assuming toroids and using modulo arithmetic.
Arbitrary Elastic Topologies
395
embodies the toroidal nearest neighbor topology. Therefore, at the minimum
1
+
YT = (2 q - 1 (ULXLT + URXR’ where the inverse exists for v > 0. Substituting back into equation 2.1, imposing the constraints in equation 2.3,and ignoring terms that do not depend on the U,gives
MKS’ Hebbian system regards the output oa of cortical cell a as coming partly from the input from the two eyes iL and iR through the connection matrix, [ULiL URiR],,and partly from the other cortical cells [Dola(MKS call C = (Z- D)-’the cortical interaction function):
+
o = VLiL+uRiR +DO
(W+ uRiR)
= (1 - D)-’
(2.5)
Hebbian learning changes U$ proportional to (oaik), where the angle brackets represent an averaging process. Defining Djk LL - ( i4j.i*L k), = (iFi;), and DF = (iii,”)
qt
YKL show that the U are moving down the gradient of
+
1
R RDRR
- D), vai’bj
ij
L RDLR + 2(Z-D)ab 1vaivbj ij }
(2.6) Compare equations 2.4 and 2.6. YKL argue that for the intent of comparing minima, one can identify K - lxf - x ~ I ’ with D f and similarly for and Df“, for some constant K. Therefore, if -23 = v 7 , these two expressions will have the same interesting minima-so, provided that the constraints are properly satisfied during learning, they should lead to the same ultimate solution: Note that this can make the effective correlations negative at some distance, which, as MKS discuss, allows correlation width to determine stripe width in their model. The cortical interaction function calculated from 7 (using v = 3/41 is shown in Figure 1. Although YKL show that this is enough to pro4YKL actually derive a different condition for matching-that (‘T = (Z- D)-I for some constant <. The truth of this would appear to depend on Caij ViV:lx; - xfl’, and the similar expressions for V R V Rand VRVL,being constant over the V that satisfy the partial constraints.
Peter Dayan
396
a -0.1
L
Distance
I b
Figure 1: Cortical interaction functions. (a) Elastic topologies. (b) MKS topologies.
duce interesting ocular dominance stripes, it is clearly not the same as
MKS cortical interaction function, which' is shown in the same figure. One reason why elastic and Hebbian models make different predictions about the factors determining stripe widths is also obvious-the cortical
Arbitrary Elastic Topologies
397
interaction function corresponding to the elastic topology is immutable. The next section considers alternatives. 3 Generalizing Elastic Topologies
The shape of the interaction function comes from the term Calya - ya+1l2 in equation 2.1. A more general quadratic form for this is
then this reduces to exactly the same expression For instance, if S = 7, as in equation 2.1. Note also that this formulation is sufficiently general as to model the case of two-dimensional retinas and cortex, although it does not extend to nonquadratic cases such as the length rather than the square of the length for the TSP. Such a change has little effect on the elastic energy function from equation 2.2, which becomes
However, differentiating E as a quadratic form to eliminate the Y leads to
+
+
(2 .S)Y,T = VLXLT PXRT
assuming that S is symmetric. If S also has a similarity property such that x b ( 2 + ~43);' does not depend on :u then substituting back in gives the energy function
As above, setting D = -US to unify the elastic and Hebbian energy functions allows Hebbian modeling of arbitrary elastic topologies and vice versa. One way of generating elastic topologies is to consider them in terms of an estimation problem. Say that X b E a b Y b is an estimate of ya. Then, the total square estimate error is 2
c Iya c 6 y b l c[(I- E)T(z- E)]abyTyb =
-
b
ab
5This holds if the topology is the same over the whole cortex.
Peter Dayan
398
Comparing this with equation 3.1 shows that S = (2-&)T(I-&). 7can be generated this way, by making &,,(,,+I) = 1 and the remaining components 0. Another example comes from estimating y,, as the average of both its neighbors, that is, setting 1 b=a+l,b=a-l 0 otherwise 6 b=a -4 b = a + l , b = a - l ; IS,,b = 4 1 b=a+2, b = a - 2 l l 0 otherwise 1
=
2
{
(3.2)
whose associated cortical interaction function more closely resembles the MKS Mexican hats (see Fig. 1). Note that although for any & there is a unique associated S and therefore C , the same is not true the other way around. Symmetric S will only have a square root if all its eigenvalues are positive, and it is easy to generate seemingly plausible C for which this is not the case. MKS generated their C as (3.3)
where K = 1/7.5and D = 7 was the width of their arbor function (the number of cortical cells to which a retinal cell would connect). Changing D changes the length scale of the cortical interaction, and so changes the optimal stripe width. Figure 1 shows graphs of the elastic net cortical interaction function, the Mexican hat one from equation 3.2, and two generated from M f f i n e with D = 7 and one with D = 14. One way to test the generalized topology terms is to use them in the cost function of equation 2.1 and to consider the optimal stripe width for the ocular dominance maps this defines? It is convenient to study the one-dimensional case, since the interesting optima are just “ Z folds, as in the left-hand side of Figure 2 (after GW). Maps inspired by the sideways ” U shape on the right-hand side of the figure will, in many cases, have lower costs than these-however, they are ruled out as the cortex does not traverse the retinas appropriately. Given particular spatial locations of the retinal cells, it is straightforward to calculate the cost per unit length of Z-folds of varying widths-the width that minimizes this is the one both Hebbian and elastic algorithms should find. GW show that the optimal width of a stripe for the basic elastic topology is 21/d, where I is the separation of the two retinas and d is the distance between two cells within a retina. simulations verified this, using the elastic net topology 7.Note that increasing I increases: . 1 - xfI2 in the third term of equation 2.4, leaving the other distance terms unaltered. 61f the appropriate constraints are satisfied, one of the equivalent equations such as 2.4 can also be used.
Arbitrary Elastic Topologies
Left retina
399
d
Left retina
Cortex
Figure 2: One-dimensional ocular dominance maps. (a) "Z-fold stripes. (b) Optimal map. For the MKS topology, the optimal width should increase with D, the length scale of the cortical interaction function. The analysis above suggests that it should also increase with I, given the common energy function. Both of these are demonstrated in Figure 3, which shows how the optimal Z-fold stripe width w varies with both I and D. With the exception of the patch for low I and high D, this is monotonic in both variables. Note that as D increases, the matrix C becomes increasingly singular, which forces implausible constraints on the cortical connectivity. Also small stripes are favored for large D and small I, since they benefit more from the negative contributions from the influences of widely separated cells than they lose through the cost of switching between the retinas. In fact the cost function becomes at least trimodal in the width of the stripes in this regime; one optimum is at the minimum stipe width, another is at the sideways " U of Figure 2, and the third is at the width that would preserve monotonicity in Figure 3. 4 Discussion
It is natural to wish to incorporate more extensive topologies into the elastic net than the nearest neighbor one with which it is presently endowed. One particular motivation for this comes from the apparent conflict between the predictions of stripe width from the elastic and Hebbian theories of the development of ocular dominance. However it is important in other cases such as graph matching in von der Malsburg's (1981) correlation theory of brain function. In this, fine scale temporal correlations in the firing of cells in a field are determined by the topology of the object being represented on that field, and inference consists of matching
400
Peter Dayan
0.5 -1- 10 7 -D32 2 -w16
Figure 3: Optimal Z-fold stripe width w versus length scale D of the cortical interaction function and distance I between the retinas. The cost function values E were calculated from equation 2.4, replacing (Z+u7)-' with C generated using equation 3.3.
this graph with an isomorphic one on another field. If the fine scale temporal correlations embody more than nearest neighbor correlations (and inference will be faster if they do), then describing this process in elastic terms will require a more general topology too. This paper has used the formalism of generalized deformable models to consider how general topologies fit into an elastic net framework. It demonstrates that this is effective by showing how the optimal stripe widths theoretically change with changing cortical length scales. However, it does remain to be seen which of the alternatives lead to stable elastic algorithms. Designer topologies are as simple to specify as designer error functions, and it will be interesting to see if there is an equivalent wealth of well-motivated examples.
Arbitrary Elastic Topologies
401
Acknowledgments
I am very grateful to Geoff Goodhill for introducing me to the problems of ocular dominance and for providing constant encouragement and assistance throughout this study, I also thank David MacKay, Ken Miller, Terry Sejnowski, Martin Simmen, David Willshaw, and Alan Yuille for their help. Support was from the SERC. References Durbin, R., and Mitchison, G. 1990. A dimension reduction framework for cortical maps. Nature (London) 343, 644-647. Durbin, R., Szeliski, R., and Yuille, A. L. 1989. An analysis of the elastic net approach to the traveling salesman problem. Neural Comp. 1,348-358. Durbin, R., and Willshaw, D. J. 1987. An analogue approach to the traveling salesman problem using an elastic net method. Nature (London) 326,689-691. Goodhill, G. J.,and Willshaw, D. J. 1990. Application of the elastic net algorithm to the formation of ocular dominance stripes. Network 1,41-61. Hopfield, J. J.,and Tank, D. W.1985. Neural computation of decisions in optimization problems. Biol. Cybern. 52, 141-152. Miller, K. D.,Keller, J. B., and Stryker, M. P. 1989. Ocular dominance column development: Analysis and simulation. Science 245,605-615. Simit, P. D. 1990. Statistical mechanics as the underlying theory of ”neural” and “elastic” optimizations. Network 1,89-104. Simit, P. D.1991. Constrained nets for graph matching. Neural Comp. 3, 268281. von der Malsburg, C. 1981. The Correlation Theory of Bruin Function. Internal report 81-2,Max-Planck Institute for Biophysical Chemistry, Gottingen, Germany. von der Malsburg, C., and Willshaw, D. J. 1977. How to label nerve cells so that they interconnect in an ordered fashion. Proc. Natl. Acad. Sci. U.S.A. 74, 5176-5178. Willshaw, D.J.,and von der Malsburg, C. 1979. A marker induction mechanism for the establishment of ordered neural mappings: Its application to the retinotectal problem. Phil. Transact. R. SOC.B 287,203-243. Yuille, A. L. 1990. Generalized deformable templates, statistical physics and matching problems. Neural Comp. 2, 1-24. Yuille, A. L., Kolodny, J. A., and Lee, C. W. 1991. Dimension reduction, generalized deformable models and the development of ocularity and orientation. Proceedings of the International Joint Conference on Neural Networks, Seattle, WA. Received 21 January 1992; accepted 24 September 1992.
This article has been cited by: 2. O. Scherf, K. Pawelzik, F. Wolf, T. Geisel. 1999. Theory of ocular dominance pattern formation. Physical Review E 59:6, 6977-6993. [CrossRef] 3. Richard S. Zemel, Geoffrey E. Hinton. 1995. Learning Population Codes by Minimizing Description LengthLearning Population Codes by Minimizing Description Length. Neural Computation 7:3, 549-564. [Abstract] [PDF] [PDF Plus] 4. Geoffrey J. Goodhill. 1993. Topography and ocular dominance: a model exploring positive correlations. Biological Cybernetics 69:2, 109-118. [CrossRef]
Communicated by Richard Lippmann
Neural Networks for Fingerprint Recognition Pierre Baldi Jet Propulsion Laboratory and Division of Biology, California Institute of Technology, Pasadena, C A 92209 U S A
Yves Chauvin Net-ID, Inc. and Department of Psychology, Stanford University, Stanford, C A 94305 USA
After collecting a data base of fingerprint images, we design a neural network algorithm for fingerprint recognition. When presented with a pair of fingerprint images, the algorithm outputs an estimate of the probability that the two images originate from the same finger. In one experiment, the neural network is trained using a few hundred pairs of images and its performance is subsequently tested using several thousand pairs of images originated from a subset of the data base corresponding to 20 individuals. The error rate currently achieved is less than 0.5%. Additional results, extensions, and possible applications are also briefly discussed. 1 Introduction
The fast, reliable, and computerized classification and matching of fingerprint images is a remarkable problem in pattern recognition that has not, to this date, received a complete solution. Automated fingerprint recognition systems could in principle have an extremely wide range of applications, well beyond the traditional domains of criminal justice and, for instance, render the use of locks and identification cards obsolete. Our purpose here is to give a brief account of our preliminary results on the application of neural network ideas to the problem of fingerprint matching. In particular, we shall describe the architecture, training, and testing of a neural network algorithm that, when presented with two fingerprint images, outputs a probability p that the two images originate from the same finger. There are several reasons to suspect that neural network approaches may be remarkably well suited for fingerprint problems. First, fingerprints form a very specific class of patterns with very peculiar flavor and statistical characteristics. Thus the corresponding pattern recognition problems seem well confined and constrained, perhaps even more so than in other pattern recognition problems, such as the recognition Neural Computation
5,402-418 (1993) @ 1993 Massachusetts Institute of Technology
Neural Networks for Fingerprint Recognition
403
of handwritten characters, where neural networks have already been applied with reasonable success (see, for instance, Le Cun et al. 1990). Second, neural networks could avoid some of the pitfalls inherent to other more conventional approaches. It has been known for over a century (see Moenssens 1971 for an interesting summary) that pairs of fingerprint images can be matched by human operators on the basis of minutia and/or ridge orientations. Minutia are particular types of discontinuities in the ridge patterns, such as bifurcations, islands, and endings. There is typically of the order of 50 to 150 minutia (Fig. 2a) on a complete fingerprint image. Ten matching minutia or so are usually estimated as sufficient to reliably establish identity. Indeed, it is this strategy based on minutia detection and matching that has been adopted in most of the previous attempts to find automated solutions. The minutia-based approach has two obvious weaknesses: it is sensitive to noise (especially with inked fingerprints, small perturbations can create artificial minutia or disguise existing ones) and computationally expensive since it is essentially a graph matching problem. Third, neural networks are robust, adaptive, and trainable from examples. This is particularly important since fingerprint images can include several different sources of deformation and noise ranging from the fingers and their positioning on the collection device (translation, roll, rotation, pressure, skin condition) to the collection device itself (ink/optical). Furthermore, it is important to observe that the requirements in terms of speed, computing power, probability of false acceptance and false rejection, memory and data base size can vary considerably depending on the application considered. To access a private residence or private car, one needs a small economic system with a small modifiable data base of a few people and a response time of at most a few seconds. On the other hand, forensic applications can require rapid searches through very large data bases of millions of records using large computers and a response time that can be longer. Neural networks can be tailored and trained differently to fit the particular requirement of specific applications. From a technical standpoint, there are two different problems in the area of fingerprint analysis: classification and matching. The classification of fingerprints into subclasses can be useful to speed up searches through large data bases. It is of interest to ask whether neural networks can be used to implement some of the conventional classification schemes, such as the partition of fingerprints patterns into whorls, arches, and loops (“pattern level classification”), or to create new classifications boundaries. Classification problems, however, will be discussed elsewhere. Here, we shall exclusively concentrate on the matching problem. Indeed, at the core of any automated fingerprint system, whether for identification or verification purposes and whether for large or small data base environments, there should be a structure that, when presented with two fingerprint images, decides whether or not they originate from
404
Pierre Baldi and Yves Chauvin
the same finger. Accordingly, our goal is going to be the design and testing of such a neural algorithm. Because neural networks are essentially adaptive and need to be trained from examples, we next describe our data base of training and testing examples and how it was constructed. We then consider the matching algorithm that consists of two stages: a preprocessing stage and a decision stage. The preprocessing stage basically aligns the two images and extracts, from each one of them, a central region. The two central regions are fed to the decision stage, which is the proper neural network part of the algorithm and subject to training from examples. Whereas the preprocessing stage is fairly standard, the decision stage is novel and based on a neural network that implements a probabilistic Bayesian approach to the estimate of the probability p of a match. In the main experiment, the network is trained by gradient descent using a training set of 300 pairs of images coming from 5 fingers, 5 images per finger. Its performance is then validated using an additional set of 4650 pairs coming from 15 additional fingers, 5 images per finger also. After training, the network achieves an overall error rate of 0.5%. Additional results and possible extensions are discussed at the end. 2 Data Base
Although there exist worldwide many fingerprint data bases, these are generally not available for public use. In addition, and this is a crucial issue for connectionist approaches, most data bases contain only one image or template per finger whereas training a neural network to recognize fingerprint images requires that several noisy versions of the same record be available for training. Therefore, to train and test a neural network, one must first construct a data base of digitized fingerprint images. Such images can be obtained in a variety of ways, for instance by digital scanner with inked fingerprints or by more sophisticated holographic techniques (Igaki et al. 1990). We decided to build our own collection device, using a simple principle. The device basically consists of a prism placed in front of a CCD (charge coupled device) camera connected to a frame grabber board installed on a personal computer (Fig. 1). When a finger is positioned on the diagonal face of the prism, incoming rays of light from one of the square sides of the prism are refracted differently depending on whether they encounter a point of contact of the finger with the prism (corresponding to a ridge) or a point of noncontact. This creates a pattern of bright and dark ridges in the refracted beam that can easily be focused, with a lens, on the CCD camera and then digitized and stored in the computer. Our resulting images are 512 x 464 pixels in size, with 8 bits gray scale per pixel. On the corresponding scale, the thickness of a ridge corresponds to 6 pixels or so. This is of course not a very economical format for the storage of fingerprint images that contain
Neural Networks for Fingerprint Recognition
405
Figure 1: Collection device: diffuse light entering the prism is not reflected where the ridges are in contact with the prism. The corresponding pattern of light and dark ridges is focused on a CCD camera, digitized on a personal computer, and sent to a workstation for further processing. a much smaller amount of useful information. Yet, this format is necessary at least in the developing phase, in particular in order to fine tune the preprocessing. In this way, we have assembled a data base of over 200 fingerprint images using various fingers from 30 different persons. To solve the matching problem, it is imperative that the data base contains several different images of the same finger taken at different times. Thus, for what follows, the most important part of the data base consists of a subset of 100 images. These are exclusively index finger images from 20 different persons, 5 different images being taken for each index finger at different times. At each collection time, we did not give any particular instruction to the person regarding the positioning of the finger on the prism other than to do so "in a natural way." In general, we made a deliberate attempt not to try to reduce the noise and variability that would be present in a realistic environment. For instance, we did not clean the surface of the prism after each collection. Indeed, we do observe significant differences among images originating from the same finger. This variability results from multiple sources, mostly at the level of the finger (positioning, pressure, skin condition) and the collection device (brightness, focus, condition of prism surface). We have conducted several learning experiments using this data base, training the networks with image pairs originated from up to 7 persons
Pierre Baldi and Yves Chauvin
406
and testing the algorithm on the remaining pairs. Here, we shall mostly report the typical results of one of our largest ex eriments where, out of the = 4950 image pairs in this data base, = 300 image pairs originated from 5 different persons are used to train the network by gradient descent. The remaining 4650 pairs of images are used to test the generalization abilities of the algorithm. Given two fingerprint images A and B, the proposition that they match (or do not match) will be denoted by M(A,B) [or M(A,B ) ] . The purpose then is to design a neural network algorithm that when presented with a pair (A,B) of fingerprint images outputs a number p = p(M) = p[M(A,B ) ] between 0 and 1 corresponding to a degree of confidence (Cox 1946) or probability that the two fingerprints match. Here, as in the rest of the paper, we shall tend to omit in our notation the explicit dependence on the pair ( A ,B) except the first time a new symbol is introduced.
rr)
E)
3 Preprocessing Stage
Any algorithm for fingerprint recognition may start with a few stages of standard preprocessing where the raw images may be rotated, translated, scaled, contrast enhanced, segmented, compressed, or convolved with some suitable filter. In our application, the purpose of the preprocessing stage is to extract from each one of the two input images a central patch called the central region and to align the two central regions. Only the two aligned central regions are in turn fed to the decision stage. The preprocessing itself consists of several steps, first to filter out high-frequency noise, then to compensate for translation effects present in the images and to segment them and finally to align and compress the central regions. For ease of description, one of the input images will be called the reference image and the other one the test image, although there is no intinsic difference between the two. 3.1 Low Pass Filtering. To get rid of the numerous high-frequency spikes that seem to be present in the original images, we replace every pixel that significantly deviates from the values of its four neighbors by the corresponding average.
3.2 Segmentation. For each image, we first draw a tight rectangular box around each fingerprint using an edge detection algorithm and determine the geometric center of the box. The central region of the reference image is then defined to be the 65 x 65 central square patch that occupies the region immediately below the previously described center. For the test image, instead we select a similar but larger patch of size 105 x 105 (extending the previous patch by 20 pixels in each direction). This larger patch is termed the window.
Neural Networks for Fingerprint Recognition
407
3.3 Alignment. We slide, pixel by pixel, the central region of the reference image across the window of the test image (by 20 pixels up, down left and right) and compute at each step the corresponding correlation, until we find the position where the correlation is maximal. This, aside from the training period, is the most computationally expensive part of the entire algorithm. The central region of the test image is then determined by selecting the central 65 x 65 patch corresponding to the position of maximal correlation (Fig. 2b). 3.4 Compression and Normalization. Finally, each one of the two 65x 65 central regions is reduced to a 32 x 32 array by discrete convolution with a truncated gaussian of size 5 x 5. This 32 x 32 compressed central region contains a low-resolution image, which corresponds roughly to 10 ridges in the original image. The resulting pixel values are conveniently normalized between 0 and 1. In our implementation, all the parameters and in particular the size of the various rectangular boxes are adjustable. The values given here are the ones used in the following simulations and empirically seem to yield good results. To avoid border effects, a 2-pixel-wide frame is usually added around the various rectangular boxes, which explains some of the odd sizes. It is also natural to wonder at this stage whether a decision regarding the matching of the two inputs could not already be taken based solely on the value of the maximal correlation found during the alignment step (3.3) by thresholding it. It is a key empirical observation that this maximal correlation, due in part to noise effects, is not sufficient. In particular, the correlation of both matching and nonmatching fingerprint images is often very high (above 0.9) and we commonly observe cases where the correlation of nonmatching pairs is higher than the correlation of matching pairs. It is therefore essential to have a nonlinear decision stage following the preprocessing. Finally, it should be noticed that during training as well as testing, the preprocessing needs to be applied only once to each pair of images. In particular, only the central regions need to be cycled through the neural network during the training phase. Although the preprocessing is not subject to training, it can be implemented, for most of its part, in a parallel fashion compatible with a global neural architecture for the entire algorithm. 4 Neural Network Decision Stage
The decision stage is the proper neural network part of the algorithm. As in other related applications (see, for instance, Le Cun et al. 19901, the network has a pyramidal architecture, with the two aligned and compressed central regions as inputs and with a single output p. The bottom level of the pyramid corresponds to a convolution of the central
408
Pierre Baldi and Yves Chauvin
Figure 2: (a) A typical fingerprint image: the surrounding box is determined using an edge detection algorithm. Notice the numerous minutia and the noise present in the image, for instance in the form of ridge traces left on the prism by previous image collections.
Figure 2 (b) Preprocessing of two images of the same finger: the left image is the reference image, the right image is the test image (same as a). The 65 x 65 central region of the reference image is shown in black right under the geometrical center (white dot). The 105 x 105 window of the test image is shown in black. The white square is the central region of the test image and corresponds to the 65 x 65 patch, within the window, which has a maximal correlation with the central region of the reference image.
Pierre Baldi and Yves Chauvin
410
regions with several filters or feature detectors. The subsequent layers are novel. The final decision they implement results from a probabilistic Bayesian model for the estimation of p, based on the output of the convolution filters. Both the filtering and decision part of the network are adaptable and trained simultaneously. 4.1 Convolution. The two central regions are first convolved with a set of adjustable filters. In this implementation only two different filter types are used, but the extension to a larger number of them is straightforward. Here, each filter has a 7 x 7 receptive field and the receptive fields of two neighboring filters of the same type have an overlap of 2 pixels to approximate a continuous convolution operation. The output of all the filters of a given type form a 6 x 6 array. Thus each 32 x 32 core is transformed into several 6 x 6 arrays, one for each filter type. The output of filter type j at position (x, y) in one of these arrays is given (for instance for A) by
(4.1)
where lr,s(A)is the pixel intensity in the compressed central region of image A at the (Y,s) location, f is one of the usual sigmoids [f(x) = (1 e-')-*], w ~ , ~is ,the ~ ,weight ~ of the connection from the (Y, s) location in the compressed central region to the (x, y) location in the array of filter outputs, and ti is a bias. The sum in 4.1 is taken over the 7 x 7 patch corresponding to the receptive field of the filter at the (x,y) location. The threshold and the 7 x 7 pattern of weights are characteristic of the filter type (so-called "weight sharing"), so that they can also be viewed as the parameters of a translation invariant convolution kernel. They are shared within an image but also across the images A and B. Thus dx,y,r,s = zd(x - r , y - s) and, in this implementation, each filter type is characterized by 7 x 7 + 1 = 50 learnable parameters. In what follows, we simplrfy the notation for the location of the outputs of the filters by letting (x,y) = i. For each filter type j, we can now form an array Azj(A, B) consisting of all the squared differences
+
A{(A,B) = [&A) - Zj(B)]'
(4.2)
and let Az = Az(A, B) denote the array of all A&A, B) for all positions i and filter types j (Fig. 3). 4.2 Decision. The purpose of the decision part of the network is to estimate the probability p = p[M(A, B)/Az(A, B)] = p(M/Az) of a match between A and B, given the evidence Az provided by the convolution filters. The decision part can be viewed as a binary Bayesian classifier. There are four key ingredients to the decision network we are proposing.
Neural Networks for Fingerprint Recognition
411
Figure 3 Network architecture: at the bottom, reference and test images A and B are presented to the network as two 32 x 32 arrays. The network extracts features from the images by convolution with several different 7 x 7 filter types. 1. Because the output of the network is to be interpreted as a probability, the usual least mean square error used in most backpropagation networks does not seem to be an appropriate measure of network performance. For probability distributions, the cross entropy between the estimated probability output p and the true probability P, summed over
Pierre Baldi and Yves Chauvin
412
all patterns, is a well-known information theoretic measure of discrepancy (see, for instance, Blahut 1987)
where, for each image pair, Q = 1 - P and q = 1- p. H is also known as the discrimination function and can be viewed as the expected value of the log-likelihood ratio of the two distributions. The discrimination is nonnegative, convex in each of its arguments, and equal to zero if and only if its arguments are equal. 2. Using Bayes inversion formula and omitting, for simplicity, the dependence on the pair (A, B) (4.4)
The effect of the priors p(M) and p(M) should be irrelevant in the case of a large set of examples. Our data base is large enough for the decision to be driven only by the data, as confirmed by the simulations (see also F i g 4 ) . In simulations, the values chosen are typically p ( M ) = 0.1 and p(M) = 0.9 [the observed value of p(M) is roughly 16% in the training set and 4% in the entire data base]. 3. We make the simplifying independence assumption that p(Az/M) = n p ( A d / M )
(4.5)
i,j
P ( A m
=
IIP(A4/m
(4.6)
i,j
Strictly speaking, this is not true, especially for neighboring locations. However, in the center of a fingerprint where there is more variability than in the periphery, it is a reasonable approximation, which leads to simple network architectures and, with hindsight, yields excellent results. 4. To completely define our network, we still need to choose a model for the conditional distributions p(Ad/M) and p(Ad/M). In the case of a match, the probability p(Ad/M) should be a decreasing function of A4. It is therefore natural to propose an exponential model of the form
where 0 < Sij < 1 and, for proper normalization, the constant Cq must take the value C i j = logsv/sij- 1. In what follows, however, we use a less general but slightly simpler binomial model of the form (4.7)
Neural Networks for Fingerprint Recognition
413
~
Figure 4: Neural decision network. This is just a neural network implementation of equations 4.4-4.8.Except for the output unit, each unit computes its output by applying its transfer function to the weighted sum of its inputs. In this network, different units have different transfer functions including: id(x) = x, a(x) = (1 + e-X)-l, logx, and exp(x) = 8.The output unit is a normalization unit that calculates p(M/Az) in the form of a quotient x/x y. The coefficients p, and q, of 4.7 and 4.8 are implemented in the form p, = a(nj)and 1-qj = g(6j). T, and 6 , are the only adjustable weights of this part of the network and they are shared with the connections originating from the convolution filter outputs. All other connection weights are fixed to 1 except for the connections running from logp, to the exponential unit, which have a weight equal to 36, the size of the receptive field of the convolution filters. The exponential F i t on the left, for instance, computes p(Az/M)p(M)= p ( M )f l i i p , [ ( l - pj)/pjIA’: using the fact that (1 - p j ) / p , = -3.Notice that the priors play the role of a bias for the exponential units, which, after training, ends up having little influence on the output.
+
414
Pierre Baldi and Yves Chauvin
with 0.5 5 pj, 9, 5 1. This is again only an approximation used for its simplicity and for the fact that the feature differences Ad are usually close to 0 or 1. In this implementation and for economy of parameters, the adjustable parameters pj and 9j depend only on the filter type, another example of weight sharing. In a more general setting, they could also depend on location. In summary, the adjustable parameters of the neural network are zdx,y,r,s, tJ,pi, and q j . In this implementation, their total number is (7 x 7 + 1) x 2 + 2 x 2 = 104. At first sight, this may seem too large in light of the fact that the network is trained using only 300 pairs of images. In reality, each one of the 50 shared parameters corresponding to the weights and bias of each one of the convolution filters is trained on a much larger set of examples since, for each pair of images, the same filter is exposed to 72 different subregions. The parameters are initialized randomly (for instance the dx,y,r,s are all drawn independently from a gaussian distribution with 0 mean and standard deviation 0.5).They are then iteratively adjusted, after each example pair presentation, by gradient descent on the cross entropy error measure H. The specific formula for adjusting each one of them can readily be derived from 4.3-4.8and will not be given here for brevity. It should also be noticed that the adaptable pattern classifier defined by 4.3-4.8is not a neural network in the most restrictive sense of a layered system of sigmoid units. It is rather a nonlinear model with adjustable parameters which can be drawn (Fig. 41,and in several different ways, in a neural network fashion. The number of units and their types, however, depends on how one decides to decompose the algebraic steps involved in the computation of the final output p .
5 Results and Conclusions
We have trained the network described in the previous sections using 300 pairs of images from our data base and only two different filter types (Fig. 5). The network performance is then tested on 4650 new pairs. The network usually learns the training data base perfectly. This is not the case, however, when only one filter type is used. The separation obtained in the output unit between matching and nonmatching pairs over the entire population is good since 99% of the matching (or nonmatching) pairs yield an output above 0.8 (or below 0.2). The error rate on the generalization set is typically 0.5% with roughly half the errors due to false rejections and half to false acceptances. In many applications, these two types of error do not play symmetric roles. It is often the case, for instance, that a false acceptance is more catastrophic than a false rejection. If, by changing our decision threshold on p , we enforce a 0% rate of false acceptances, the rate of false rejections increases to 4%. This error rate
Neural Networks for Fingerprint Recognition
415
Figure 5: Unit activation throughout the network when two matching fingerprint images are presented as inputs. Flow of activation is left to right. Images A and B are presented to the network as 32 x 32 arrays. The network has two filters 1 and 2, each represented by the corresponding 7 x 7 pattern of weights. Each filter is convolved with each array and generates the set of feature arrays lA, 1B, 2A, and 2B. The next layer computes the squared feature differences for each filter. Finally, the similarity S = p ( M / A z )is computed. In this example the similarity is close to 1 (representedby a black vertical bar), close to the target value T = 1. Notice the essentially binary values assumed by the features and the compact representation of the input images with 72 bits each. needs of course to be reduced, but even so it could be almost acceptable for certain applications. As in other related applications, the interpretation of the filter types discovered by the network during learning is not always straightforward (Figs. 5 and 6). We have conducted various experiments with up to four filter types but on smaller data bases. Usually, at least one of the filter types always appears to be an edge or a ridge orientation detector. Some of the other filter types found in the course of various experiments may be interpretable in terms of minutia detectors, although this is probably more debatable. On the completion of the training phase, the outputs of the filters in the decision stage are close to being binary 0 or 1. Since the final decision of the network is entirely based on these outputs, these provide a very compressed &presentation of all the relevant matching information originally contained in the 512 x 464 x 8 bits images. Thus, in this im-
416
Pierre Baldi and Yves Chauvin
Figure 6 Examples of filters learned in different experiments: one filter is represented by the corresponding 7 x 7 patterns of weights. The size of each square represents the value of the corresponding weight. Black and white squares correspond to positive and negative weights, respectively. Whereas the first filter seems to be an edge or ridge orientation detector, the other two are more difficult to interpret. It may be tempting to describe them in terms of minutia detectors, such as an ending and a bifurcation detector, but this may not necessarily be the case. plementation, each image is roughly reduced to 36 x 2 = 72 bits, which is within a factor of two from a rough estimate of the theoretical lower bound (the number of human beings, past and present, is approximately 233 M 8.5 x 109). In applications, the algorithm would not be used in the same way as it has been during the training phase. In particular, only the central regions of the reference images need to be stored in the data base. Since the forward propagation through the decision stage of the algorithm is
Neural Networks for Fingerprint Recognition
417
very fast, one can in fact envision a variation on the algorithm where the alignment step in the preprocessing is modified and where the reference image is stored in the data base only in the most compressed form given by the corresponding outputs of the filters in the decision stage. In this variation, all the possible 65 x 65 square patches contained in the 105x 105 window of the test image would be sent, after the usual compression and normalization preprocessing steps, through the convolution filter arrays and then matched through the neural network with the corresponding outputs for the reference image. The final decision would then be based on an examination of the resulting surface of p values. Whether this algorithm leads also to better decisions needs further investigation. To reduce the error rate, several things can be tried. One possibility is to use more general exponential models in 4.7 and 4.8 rather than binomial distributions. Alternatively, the number of filter types or the number of free parameters could be increased (for instance by letting pi and qj depend also on location) as well as the size of the training and validation sets. Another possibility is to use in the comparison larger windows or, for instance, two small windows rather than one, the second window being automatically aligned by the alignment of the first one. A significant fraction of the residual false rejections we have seems to be due to rotation effects, that is, to the fact that fingers are sometimes positioned at different angles on the collection device. The network we have described seems to be able to deal with rotations up to a lo" angle. Larger rotations could easily be dealt with in the preprocessing stage. It is also possible to incorporate a guiding device into the collection system so as to entirely avoid rotation problems. In this study, we have attempted to find a general purpose neural network matcher that could be able, in principle, to solve the matching problem over the entire population. In this regard, it is remarkable that the network, having been trained with image pairs associated with only five different persons, generalizes well to a larger data base associated with 20 persons. Obviously, a general purpose network needs also to be tested on a larger sample of the population. Unlike the classification problem, however, the matching problem should be much less sensitive to the size of the training and testing sets. In classifymg whorls for instance, it is essential to expose the network to a large sample representative of whorl patterns across the entire population with all their subtle statistical variations. In our matcher, we are basically substracting one image from the other and therefore only the variability of the difference really matters. Specific applications, especially those involving a small data base, have particular characteristics that may be advantageously exploited both in the architecture and the training of networks and raise also some particular issues. For a car lock application, for instance, positive matches occur only with a pair of fingerprints associated with a small data base of only a few persons. Positive matches corresponding to fingerprints
418
Pierre Baldi and Yves Chauvin
associated with persons outside the data base are irrelevant. They may not be needed for training. Conceivably, in small data base applications, a different network could be trained to recognize each record in the data base separately against possible imposters. Finally, the approach we have described and especially the Bayesian decision stage with its probabilistic interpretation is not particular to the problem of fingerprint recognition. It is rather a general framework that could be applied to other pattern matching problems where identity or homology needs to be established. The corresponding neural networks can easily be embodied in hardware, especially once the learning has been done off-line. As already pointed out, most of the steps in the preprocessing and the decision stages are in fact convolutions and are amenable to parallel implementation. On a workstation, it currently takes on the order of 10 sec to completely process a pair of images. This time could be reduced by a few orders of magnitude with specially dedicated hardware. References Blahut, R. E. 1987. Principles and Practice of Information Theory. Addison-Wesley, Reading, MA. Cox, R. T. 1946. Probability, frequency and reasonable expectation. Am. J. Phys. 14(1), 1-13. Igaki, S., Eguchi, S., and Shinzaki, T. 1990. Holographic fingerprint sensor. Fujitsu Tech.J. 25(4), 287-296. L.e Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Neural Information Processing Systems, Vol. 2, pp. 396-404. Morgan Kaufmann, San Mateo, CA. Moenssens, A. A. 1971. Fingerprint Techniques. Chilton Book Company, Radnor, PA. Received 28 April 1992; accepted 6 October 1992.
This article has been cited by: 2. U. Halici, G. Ongun. 1996. Fingerprint classification through self-organizing feature maps modified to treat uncertainties. Proceedings of the IEEE 84:10, 1497-1512. [CrossRef]
Communicated by Yann Le Cun
Centered-Object Integrated Segmentation and Recognition of Overlapping Handprinted Characters Gale L. Martin MCC,Austin,TX 78759 USA Visual object recognition is often conceived of as a final step in a visual processing system, First, physical information in the raw image is used to isolate and enhance to-be-recognized clumps and then each of the resulting preprocessed representations is fed into the recognizer. This general conception fails when there are no reliable physical cues for isolating the objects, such as when objects overlap. This paper describes an approach, called centered object integrated segmentation and recognition (COISR), for integrating object segmentation and recognition within a single neural network. The application is handprinted character recognition. The approach uses a backpropagation network that scans a field of characters and is trained to recognize whether it is centered over a single character or between characters. When it is centered over a character, the net classifies the character. T h e approach is tested on a dataset of handprinted digits and high accuracy rates are reported. 1 Introduction Most character recognition systems fail when characters touch each other or when an individual character is broken up by intervening white space. The systems use intervening white space to segment a character string into individual characters, so that classificationcan be done one character at a time. As a preprocessing step prior to recognition, segmentation simplifies the recognition task. It restricts the number of input features to those associated with a single character and makes it possible to filter out irrelevant variations by normalizing for factors such as size and skew, on a character by character basis. The problem with this separation of segmentation and recognition is that it fails in the instances noted above, where human vision does not. For people, segmentation and recognition seem to be integrated, interdependent processes. This paper describes test results of an approach, called centered object integrated segmentation and recognition (COISR), for integrating character segmentation and recognition within one neural network (Martin 1990). The approach builds on previous work in presegmented character recognition (Le Cun et al. 1990; Martin and Pittman 1990), and on the sliding Neural Computation 5,419-429 (1993) @ 1993 Massachusetts Institute of Technology
420
Gale L. Martin
Figure 1: The COISR approach uses a multilayered net trained through backpropagation to classify what is centered in its input window. The input window scans over a field of characters that has been size normalized with respect to the vertical height of the field. The output layer consists of one node for each of the characters (0-9) and one node for a no-centered-character state. The latter state occurs whenever the scanning window is centered between characters or between a character and the edge of the field. To identify the characters in the field the output node activations are evaluated over time. When the NOTCENTERED node value is less than a threshold, the values for each character node are summed over time until the NOT-CENTERED node value exceeds the threshold. The maximum of these sums determines the classificationjudgment for that character. window conception used in neural network speech applications, such as NETtalk (Sejnowski and Rosenberg 1986) and Time Delay Neural Networks (Waibel el al. 1988). It is assumed that segmenting a character, in the sense of separating it from its background, is not a necessary precursor to recognizing the character, or to learning to recognize the character. The system is trained to recognize what is centered in its input window, and the window size is chosen to be large enough to maintain the natural contexts of the character-the surrounding characters, which may touch or overlap with the centered character. As shown in Figure 1, the net is trained on an input window, and a target vector representing what is in the center of the window. The window scans along the field, so that sometimes the middle of the window is centered on a character, and sometimes it is centered on a point
COISR of Overlapping Handprinted Characters
421
between two characters. The target output vector consists of one node per category and one node corresponding to a NOT-CENTERED condition. This latter node has a high target activation value when the input window is not centered over any character. There is no need to explicitly segment characters or to train the net on the exact physical extent of the characters, because recognition is defined as identifying what is in the center of the scanning window. That is, there is no need to pretrain the network on individual presegmented characters removed from their natural context of surrounding letters. The net learns to extract regularities in the shapes of individual characters even when those regularities occur in the context of overlapping and broken characters at both training and testing stages. The final stage of processing includes parsing the temporal stream produced by the computed output vectors of the net as it scans the field to identify the highest peaks of activation, which correspond to the characters in the field. The COISR approach was tested using the NIST database of handprinted digit fields. 2 Test of the Approach
2.1 Image Database. The NIST (National Institute of Standards and Technology)database contains 273,000 samples of handwritten numerals collected from Bureau of Census field staff. Each of 2100 census workers filled in a form with 33 fields, 28 fields of which contain only handwritten digits. The present study used fields 6-30 of the form, which correspond to five different fields of length 2, 3, 4, 5, or 6 digits each. The training data included roughly 80,000 digits (800 forms, 20,000 fields), and came from forms labeled fO000-fO499 and f1500-fl799 in the dataset. The test data consisted of roughly 20,000 digits (200 forms, 5000 fields) and came from forms labeled f1800-fl899 and f2000-f2099 in the dataset. The samples were scanned at a 300 pixel/inch resolution. 2.2 Image Preprocessing. Each original field image depicts a region that contains a small amount of white space surrounding a box, in which the digit field was written. The relative heights of the digits with respect to the box differ across fields. To minimize input size, each field image is preprocessed to eliminate the white space surrounding the box, the box itself, and the white space surrounding the digit field. Each field is then size normalized with respect to the vertical height of the digit field. Experimentation with different heights revealed that 20 pixels is sufficient to enable resolution of the digits by a human observer. Since the input is size normalized to the vertical height of the field of characters, the actual number of characters in the constant-width input window varies depending on the relative height-to-width ratio for each character. The visual context provided by the surrounding characters may help in disambiguating confusions in cases where there is sufficient redundancy
422
Gale L. Martin
in the sequence of characters (Martin 1990). An input pattern generator is then passed over the field, creating input windows 36 pixels wide and 20 pixels high, at 3-pixel increments across the field. Thus, the number of input patterns created for a field is greater than the number of digits in the field. 2.3 Position Labeling. A key design principle of the present approach is that highly accurate integrated segmentation and recognition requires training on both the shapes of characters and their positions within the input window. The field images used for training were labeled with the horizontal center positions of each character in the field. The human labeler simply pointed at the horizontal center of each digit in sequence with a mouse cursor and clicked on a mouse button. The horizontal position of each character was then paired with its category label (0-9) in a datafile. This labeling process is quite efficient and considerably less time consuming than the standard segmentation method of drawing a box around each character. The process is not unlike a human reading teacher using a pointer to indicate the position of each character as he or she reads aloud the sequence of characters making up the word or sentence. During testing this position information is not used. 2.4 Target Outputs. The position information about the locations of character centers is used to generate target output values during the generation of input patterns for training. When the center position of a window is within plus or minus 2 pixels of the center of a character, the target value of that character's output node is set at the maximum (0.8 during initial training and 1.0 during subsequent training), with the target value of the NOT-CENTERED node set at the minimum (0.2 during initial training and 0.0 during subsequent training). The activation values of all other characters' output nodes are set at the minimum. When the center position of a window is within plus or minus 2 pixels of the halfway point between two character centers, the reverse situation holds. The target values of all character output nodes are set to the minimum and the target value of the NOT-CENTERED node is set to a maximum. Between these two extremes, the target values vary linearly with distance, creating a trapezoidal function. 2.5 Network Architecture. The neural network is a 2-hidden-layer backpropagation network, with local, shared connections in the first hidden layer, and local connections in the second hidden layer (see Fig. 2). This architecture was chosen to minimize memory requirements and the time required to process each pattern, while retaining sufficient capacity for the net to sufficiently learn the training patterns. Several attempts to train nets with shared weights in both layers, and thereby significantly reduce memory requirements, yielded poor training performance. Pre-
COISR of Overlapping Handprinted Characters
-To
423
2nd Hidden Layer
Local connections (6JdXl&OffXt 2)
1st Hidden Layer
Local, shared connection
Figure 2: The architecture for the centered-object integrated segmentation and recognition net has two hidden layers. sumably, this is because position information is necessary to classify what is centered in the input window. The input window to the network is 36 pixels wide by 20 pixels high, with pixels taking on grayscale values ranging from 0 to 1. The first hidden layer consists of 2016 nodes, or more specifically 18 independent groups of 112 (16 x 7) nodes, with each group having local, shared connections to the input layer. It can be visualized as a 16 x 7 x 18 cube of nodes. The local, shared connections within a group ensure that the same feature map develops for all nodes in that group. Node bias values within a group are shared as well. Each node receives input from a local 6 x 8-pixel region. These local, overlap
424
Gale L. Martin
ping regions are offset by 2 pixels, such that the regions covered by each group of nodes span the input layer. The second hidden layer consists of 180 nodes, having local, but not shared connections. This layer can be visualized as a 6 x 3 x 10 cube. Each node in this layer receives input from a local 6 x 3 x 18-node region on the first hidden layer. These local, overlapping regions are offset by 2 pixels. The output layer consists of 11 nodes, with each of these nodes connected to all of the nodes in the second hidden layer. The net has a total of 2927 nodes (includes input and output nodes) and 157,068 connections. In a feedforward (nonlearning) mode on a DEC 5000 workstation, in which the net is scanning a field of digits, the system processes about two digits per second. This figure includes the image preprocessing, as well as the number of necessary feedforward passes to recognize 2 digits. The actual number of forward passes required to recognize a given field of characters depends on the width of the characters, the distance between characters, and the scan increment. As an example, at the present scan increment of 3 pixels, a typical field of 6 digits required 33 forward passes of the net, or between 5 and 6 forward passes per recognized character. 2.6 Tkaining Parameters. During initial training, the target output values were set at 0.8 (on) and 0.2 (off), with these changing to 1.0 and 0, respectively, later on. The learning rate was set at 0.05 during initial training, with this value changing to 0.01 when a plateau in training performance was reached. The momentum term was set at 0.9 throughout training. Training and testing alternated, with the highest testing performance reported. The training was done in a randomly permuted order, such that the order of training images is quite different than the sequential scans used during testing.
2.7 Output Parsing. As the net scans horizontally, the activation values of the 11output nodes create a trace as shown in Figure 1. To convert this to an ASCII string corresponding to the digits in the field, the trace is parsed as follows. The state of the NOT-CENTERED node is monitored continuously. When its activation value falls below a threshold (0.41, a summing process begins for each of the other nodes, which ends when the activation value of the NOT-CENTERED node exceeds the threshold. The activation values of each of the other nodes are multiplied by 1 minus the activation value of the NOT-CENTERED node, and added to a running total that is accumulated for each node. When the activation value of the NOT-CENTERED node then exceeds the threshold, this is interpreted as the input window having moved off of a character, and the system classifies the character, on the basis of which output node has the highest running total, or peak of activation, at this place in the field. Across the characters in the field thus classified, the system also orders the corresponding peaks of activation. In applications involving
COISR of Overlapping Handprinted Characters
'0
10
20
x)
4
50
425
m
70
m
w i m
90 REJECTIONS
Figure 3: Test error and reject rates for fields of lengths 2 through 6 digits. The test set consists of about 20,000 characters and 5000 fields. The digits within the lines correspond to the number of characters in the field. digit fields, it is often the case that there is a priori knowledge of how many digits are in the field. When the parsing process indicates that too many digits have been identified (i.e., an insertion error has occurred), the system takes the n highest peaks, where n is the number of expected digits in the field. When the parsing process indicates that too few characters have been identified, the field is automatically rejected. The results reported below were obtained using this method of reducing the number of insertion and deletion errors. 3 Results
As shown in Figure 3, the COISR technique achieves low field-based error rates. The error rates are field based in the sense that if the network
426
Gale L. Martin
Figure 4: Examples from the test set of handwritten digit fields containing touching, overlapping, and broken characters that the COISR net correctly recognizes. misclassifies one character in the field, the entire field is considered as misclassified. Rejections are based on placing a threshold for the acceptable distance between the highest and the next highest running activation total. In this way, by varying the threshold, the error rate can be traded off against the percentage of rejections. Since the reported data apply to fields, the threshold applies to the smallest distance value found across all of the characters in the field. Figure 4 provides examples, from the test set, of fields that the COISR network correctly classifies. The examples illustrate several common problems that plague conventional character segmentation and recognition systems and yet are dealt with successfully by the present system. These include fields containing multiple, contiguous touching characters (e.g., 95304,03283,8201), pairs of characters that touch at multiple points (e.g., 05, 556, 399, 1509, 091, image noise that places lines through multiple characters in a field (e.g., 500), and broken characters (e.g., 6354, 157, 641, 7963). Broken characters can be troublesome to conventional character segmentation and recognition systems for at least two reasons. Characters that are randomly broken (e.g., 79631,where this is primarily caused by image degradation or poor capture, can be difficult to recognize by systems that use a separate feature extraction stage prior to classification. Some of the features searched for are defined by small dot or stroke-like segments surrounded by white space, and hence can be mistakenly classified with broken characters. The other problem is caused by the previously mentioned, separate segmentation stage. One
COISR of Overlapping Handprinted Characters
427
type of preprocessor looks for columns of white space, or low points in the density histogram, to use as the basis for separating characters. Using this strategy, it seems quite difficult to come up with a reliable accurate segmenter that works for both broken characters (e.g., 6354,157,641) and nonbroken close or touching characters (e.g., 92,43862,389). The COISR technique works around these problems because the net is trained on image input and does not require the separate, preceding segmentation stage. 4 Discussion
The COISR technique makes it possible to do something that conventional character recognition systems cannot do: robustly recognize character fields containing touching, overlapping, and broken characters. Conventional character recognition systems require separate segmentation that precedes classification in both training and testing. Since such segmentation is based only on physical indicators (e.g., density histograms), and usually is independent of the recognition stage, it fails in cases where the appropriate segmentation depends on the classification judgment. Conventional systems can be altered to achieve integrated segmentation and recognition in limited cases that involve handcrafting and a significant amount of iterative processing (Fenrich 1991; Shridhar and Badreldin 1986,1987). One approach is to use multiple schemes for segmentation during training and testing, thereby creating multiple possible segmentations. These include histograming techniques, searching for vertical concavities that might signal a valid breakpoint between characters, and using rules that can effectively reconstruct commonly broken characters. Classification is performed for each such possible segmentation. This is integrated segmentation and recognition in the sense that final segmentation and classification judgments are interdependent. However, the approach presumably breaks down as the number of possible segmentations increases, as would occur for example if individual characters are broken or touching in multiple places, multiple letters in a sequence are connected, and the size of the vocabulary and variability of the writing styles increase. The COISR approach does not have this weakness because the net learns to classify images of characters in their natural states, be that broken, connected or unconnected, and because the NOT-CENTERED output node receives its input from the same hidden nodes that develop to classify characters. The COISR approach has similarities and differences with respect to another backpropagation-based integrated segmentation and recognition approach initially proposed by Rumelhart (1989) and then developed jointly at MCC by Keeler and Rumelhart (Keeler et al. 1991; Keeler and Rumelhart 1992). Keeler et al. refer to their approach as se2f-organizing
428
Gale L. Martin
integrated Segmentation and recognition (SOISR). Both COISR and SOISR techniques do not require any separate segmentation stage during testing, and both achieve integrated segmentation and recognition by first convolving a simple variant of a backpropagation network over a character field, and then parsing the resulting activation peaks to interpret the output of the convolutions. One implementation-level difference between the two is that SOISR performs the convolution in parallel, rather than over time as is the case with COISR. The primary difference between the two lies in what information about position or physical extent is required for training. To achieve their reported accuracy rates, Keeler et al. pretrain on characters that have first been presegmented; a person essentially draws a box around each character and the net learns to recognize these isolated characters. In subsequent training, a net with these pretrained weights replicated across the extent of a fixed-width input field is further trained on examples of fields that can contain connecting or broken characters. No position information, in the sense of the character-center positions used by COISR, is required with the SOISR approach. The COISR approach, on the other hand, does not require pretraining on isolated characters to achieve the reported accuracy rates. An approach very similar to the SOISR system has recently been developed by Matan and his colleagues (Matan et al. 1992). A weakness of the COISR approach is that it performs essentially an exhaustive scan over the to-be-classified input field. This means that the components needed to recognize a character in a given location must be replicated across the length (and height, if two dimensions are used) of the to-be-classified input field, at the degree of resolution necessary to recognize the smallest and closest characters. Since completion of the present work, we have developed a similar system, modeled loosely after human eye movements, that is trained to jump over blank areas within a field (Martin and Rashid 1992; Martin et al. 1993). This reduces the number of forward passes required per character to about 1.5. Acknowledgments I thank Lori Barski, John Canfield, David Chapman, Roger Gaborski, Jay Pittman, Mosfeq Rashid, and Dave Rumelhart for helpful discussions and/or development of supporting image handling and network software. I also thank Jonathan Martin for help with the position labeling. References Fenrich, R. 1991. Segmentation of automaticallylocated handwrittenwords. Paper presented at the International Workshop on Frontiers in Handwriting Recognition, Chateau de Bonas, France, 23-27 September.
COISR of Overlapping Handprinted Characters
429
Keeler, J., and Rumelhart, D. E. 1992. A self-organizing integrated segmentation and recognition neural network. In Advances in Neural Information Processing Systems, Vol. 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds. Morgan Kaufmann, San Mateo, CA. Keeler, J. D., Rumelhart, D. E., and Leow, Wee-Kheng. 1991. Integrated segmentation and recognition of handprinted numerals. In Advances in Neural Informution Processing Systems, Vol. 3, J. E. Moody and D. S. Touretzky, eds., pp. 557-563. Morgan Kaufmann, San Mateo, CA. Le Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. 1990. Handwritten digit recognition with a backpropagation network. In Advances in Neural Information Processing Systems, Vol. 2, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Martin, G. L. 1990. Integrating segmentation and recognition stages for overlapping handprinted characters. MCC Tech. Rep. ACT-"-320-90. Martin, G. L., and Rashid, M. 1992. Recognizing overlapping hand-printed characters by centered-object integrated segmentation and recognition. In Advances in Neural Information Processing Systems, Vol. 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds. Morgan Kaufmann, San Mateo, CA. Martin, G. L., Rashid, M., and Pittman, J. 1993. Integrated segmentation and recognition through exhaustive scans or learned saccadic jumps. International Journal of Pattern Recognition and Artificial Intelligence (in press). Martin, G. L., and Pittman, J. A. 1990. Recognizing hand-printed letters and digits. In Advances in Neural Information Processing Systems, Vol. 2, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Matan, O., Burges, C. J. C., Le Cun, Y., and Denker, J. S. 1992. Multi-digit recognition using a space displacement neural network. In Advances in Neural Information Processing Systems, Vol. 4, J. E. Moody, S. J. Hanson, and R. P. Lippmann, eds. Morgan Kaufmann, San Mateo, CA. Rumelhart, D. 1989. Learning and generalization in multilayer networks. Presentation given at the NATO Advanced Research Workshop on Neuro Computing Algorithms, Architectures and Applications, Les Arcs, France, February. Sejnowski, T. J., and Rosenberg, C. R. 1986. NETtalk: A parallel network that learns to read aloud. JohnsHopkins University Electrical Engineering and Computer Science Tech. Rep. JHU/EEC!3-86/01. Shridhar, M., and Badreldin, A. 1986. Recognition of isolated and simply connected handwritten numerals. Pattern Recog. 19, 1-12. Shridhar, M., and Badreldin, A. 1987. Context-directed segmentation algorithm for handwritten numeral strings. Image Vision Comput. 5,3-9. Waibel, A., Sawai, H., and Shikano, K. 1988. Modularity and scaling in large phonemic neural networks. ATR Interpreting Telephony Research Laboratories Tech. Rep. TR-1-0034. Received 7 November 1991; accepted 19 October 1992.
This article has been cited by: 2. Cheng-Lin Liu, H. Sako, H. Fujisawa. 2004. Effects of classifier structures and training regimes on integrated segmentation and recognition of handwritten numeral strings. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:11, 1395-1407. [CrossRef] 3. C. Garcia, M. Delakis. 2004. Convolutional face finder: a neural architecture for fast and robust face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 26:11, 1408-1423. [CrossRef] 4. Gale L. Martin. 2004. Encoder: A Connectionist Model of How Learning to Visually Encode Fixated Text Images Improves Reading Fluency. Psychological Review 111:3, 617-639. [CrossRef] 5. Rafael Palacios, Amar Gupta, Patrick S. P. Wang. 2003. Feedback-based architecture for reading courtesy amounts on checks. Journal of Electronic Imaging 12:1, 194. [CrossRef] 6. David Horn , Irit Opher . 1996. Temporal Segmentation in a Neural Dynamic SystemTemporal Segmentation in a Neural Dynamic System. Neural Computation 8:2, 373-389. [Abstract] [PDF] [PDF Plus]
Communicated by Steven Zucker
Surface Interpolation Networks Alex P. Pentland Perceptual Computing Group, The Media Laboratory, Massachusetts Institute of Technology, Room E15-387,20 Ames Street, Cambridge M A 02139 USA
Orthogonal wavelets can be used as models for receptive fields in the human visual system. They may also be used to solve spatial interpolation problems formulated either as regularization or 2-DKalman filtering. The solutions take the form of simple feedback networks, and only a few iterations are required for convergence. 1 Introduction There exist families of interpolation problems in vision. Most familiar are the interpolation problems of stereopsis and contrast. Stereopsis provides sparse measurements of disparity which are filled in to produce the percept of a continuous surface. Similarly, the visual system’s initial filters respond only to constrast changes at moving edges; these contrast changes are filled in to produce the perceived pattern of lightness. Perhaps the most well-known interpolation theory in computational vision is regularization (Poggio ef al. 1985; Terzopoulos 1988). Using this approach optimal RMS estimates of the surface can be obtained under the assumption that the surface can be characterized as a stationary Markov process. However, this theory has the drawback that the interpolation network requires hundreds or even thousands of iterations to produce a smoothly interpolated surface. Thus there is a need for a more efficient theory of interpolation. In addition, there is a desire for more biologically plausible ones. Interpolation methods such as regularization are applicable only to stationary signals, that is, single images or static environments. To inte grate information across multiple views in a nonstationary environment requires the techniques of optimal estimation using dynamic models. The Kalman filter, perhaps the most common optimal estimation technique, produces optimal RMS estimates of the surface for both stationary processes and nonstationary processes with linear dynamics. In computer vision Kalman filter systems have been built (Matthies et al. 19891,and have been shown to be more accurate and robust than singleimage techniques such as regularization. However, these algorithms have been complex, computationally expensive, and not plausible as biological models. Neural Computation 5,430442 (1993) @ 1993 Massachusetts Institute of Technology
Surface Interpolation Networks
431
In this paper I will show how efficient, biologically plausible solutions to these interpolation problems can be obtained by using networks with orthogonal wavelet receptive fields. Numerical examples using natural imagery will be shown. An implementation in C code is available by anonymous FTP from whitechapel.media.mit.eduin the file /u/ftp/misc/wavelet.reg.tar.Z.
1.1 Background: Regularization. Surface interpolation typically involves constructing a piecewisesmooth surface given a sparse set of noisy contrast or disparity measurements. Because the measurements are sparse, the problem is ill-posed and requires adding a smoothing or regularizing term to obtain a solution in areas away from the measured disparities or contrasts. Mathematically, the static interpolation problem is formulated as finding a suitable function U that minimizes a smooth energy functional E(U) given boundary conditions 2). By taking the variational derivative 6, of the energy functional and discretizing over a lattice of n nodes, the following matrix equation is obtained (Poggio et al. 1985; Terzopoulos 1989): XKU + SU - D = 0
(1.1)
In this equation U is an n x 1 vector of unknown displacements for each of the n nodes, K is an n x n matrix called the regularizing or smoothness matrix, D is an n x 1 vector whose nonzero entries are the measured sensor data, S is a diagonal "selection matrix" with ones at nodes with sensor measurements, and X is a scalar constant that balances the relative influence of the data and regularization terms. An interpolated surface U that solves equation 1.1 can be obtained by iterating a two-layer network with center-surround receptive fields (Terzopoulos 1989). Unfortunately, several thousand iterations are typically required to obtain an interpolated surface; even if complex multiresolution techniques are employed, several hundred iterations are still required. 2 Orthogonal Wavelets
The amount of computation required for surface interpolation is proportional to both the bandwidth and condition number of K. Both can be greatly reduced by transforming the problem to another basis or coordinate system. In neural systems, such a transformation can be accomplished by passing incoming disparity or contrast measurements through a set of receptive fields; the shapes of the receptive fields are the new basis vectors, and the resulting neural activities are the coordinates of the measurement data in the coordinate system defined by these basis vectors. If the receptive fields are orthonormal, then we can convert back
432
Alex P. Pentland
Figure 1: Left column: The orthogonal wavelet filter family used in this paper (filters have been arbitrarily scaled for display). Right column: The power spectra of these filters on a linear scale.
to the original coordinate system by adding up the same receptive fields in amounts proportional to the associated neuron’s activity. A class of bases that greatly improve the bandwidth ahd condition number of K are generated by functions known as orthogonal wavelets. A family of orthogonal wavelets ha,b is constructed from a single function
Surface Interpolation Networks
433
h by dilation of a and translation of b (Daubechies 1988; Mallat 1989; Simoncelli and Adelson 1990))
ha,$= (al-”2h
(?)
,
a
#0
(2.1)
Typically u = 2j and b = 1,.. . ,n = 2j for j = 1 , 2 , 3 .. . . In an orthogonal wavelet family all of the wavelets-including all translations, sizes, and orientations-are orthogonal to one another, and thus form a multiscale orthonormal basis. I will call such a basis awl where the columns of the n x n matrix iDw are the basis vectors. Because awforms an orthonormal basis, it (like the Fourier transform) is self-inverting, for example, *Law= 4Dw@; = I. Consequently the computational properties of orthogonal wavelets are very different from, for example, Gabor or derivative-of-gaussian wavelets, even though they may appear visually similar. The wavelet basis awused in this paper was ”learned” using a gradient-descent procedure (Simoncelli and Adelson 1990) starting with difference-of-gaussians as the initial “guess” at an orthogonal basis. The lefthand column of Figure 1 shows a subset of aw; from top to bottom are the basis vectors corresponding to u = 1,2,4,8,16and b = n/2, together with their Fourier power spectrum. As can be seen, these wavelets are similar to those known to exist in human vision; for instance, there is only a 7.5% MSE difference between these wavelets and those of the Wilson-Gelb model of human spatial vision (Pentland 1991; Wilson and Gelb 1984). On digital computers transformation to the wavelet coordinate system is normally computed recursively using separable filters (Simoncelli and Adelson 1990). Because the process is recursive it requires only O(n) operations, and is thus among the most efficient of all orthogonal transforms. The wavelet transform of a 128 x 128 pixel image requires only 2.0 sec on a Sun 4/330. 3 Surface Interpolation Using Wavelet Bases
It has been proven that by using orthogonal wavelet bases smooth linear operators can be represented extremely compactly (Albert et al. 1990). This suggests that awis an effective preconditioning transform, and thus may be used to obtain very fast approximate solutions. The simplest method is to transform a previously defined K to the wavelet basis, discard off-diagonal elements,
i-2;
= diag
(@:Kaw)
(3.1)
and then solve. Note that for each choice of K the diagonal matrix i-2; is calculated only once and then stored; further, this calculation requires only O(n) operations. In ‘numericalexperiments I have found that for a
Alex P. Pentland
434
typical K the summed magnitude of the off-diagonals of K is approximately 5% of the diagonal's magnitude, so that we expect to incur only small errors by discarding off -diagonals. Case 1. The simplest case of surface interpolation is when sensor measurements exist for every node so that the sampling matrix S = I. Substituting @,U = U and premultiplying by @; converts equation 1.1 to X@;K@,U
+ @;@,U= @;D
(3.2)
By employing equation 3.1, we then obtain
(An: + 1)U = @;D
(3.3)
The approximate interpolation solution U is therefore U = @,(Xs2:
+ I)-'@;D
(3.4)
Note that this computation is accomplished by simply transforming the input data D to the wavelet basis, multipling the filter output appropriately at each level of recursion, and then transforming back to the original coordinate system. To obtain an approximate regularized solution for an fi x fi image using a wavelet of width w therefore requires approximately 8wn + n add and multiply operations. Case 2. In the more usual case where not all nodes have sensor measurements, the interpolation solution may require iteration. In this case the sampling matrix S is diagonal with ones for nodes that have sensor measurements, and zeros elsewhere. Again substituting aWU =U and premultiplying by @; converts equation 1.1 to X*;K@,U
+ @P1;S@,U = @;D
(3.5)
The matrix @;S@, is diagonally dominant so that the interpolation solution U may be obtained by iterating
U"' = @(,
+ S)-'@iD' + Uf
(3.6)
where S = diag(@;SaW) and Dt = D - (K + S)U' is the residual at iteration t. Normally no more than three to five iterations of equation 3.6 are required to obtain an accurate estimate of the interpolated surface; often a single iteration will suffice. Note that for this procedure to succeed, the largest gaps in the data sampling must be significantly smaller than the largest filters in the wavelet transform. Further, when X is small and the data sampling is sparse and irregular, it can happen that the off-diagonal terms of @;S*, introduce significant error. Therefore when using small X it is best to perform one initial iteration with a large A, and then reduce X to the desired value in further iterations. Discontinuities. The matrix K describes the connectivity between adjacent points on a continuous surface; thus whenever a discontinuity
Surface Interpolation Networks
435
occurs K must be altered. Following Terzopoulos (1988), we can accomplish this by disabling receptive fields that cross discontinuities. In a computer implementation, the simplest method is to locally halt the recursive construction of the wavelet transform whenever one of the resulting bases would cross a discontinuity. An Example. Figure 2a shows synthetic disparity measurements input to a 64 x 64 node interpolation problem (zero-valued nodes have no data); the vertical axis is disparity. These data were generated using a sparse (10% random sampling of the function z = lOO[sin(kxj+ sin(ky)]. Figure 2b shows the resulting interpolated surface. In this example equation 3.6 converged to within l%of its true equilibrium state with a single iteration. Execution time was approximately 1 sec on a Sun 4/330. 4 Surface Interpolation in Nonstationary Environments
The technique of regularization is applicable only to stationary signals, that is, single images or static environments. To integrate information across multiple views in a non-stationary environment, the techniques of optimal estimation and dynamic systems must be used to smooth the interpolated surface over time as well as space (Friedland 1986; Matthies et al. 1989). The simplest such technique is known as the Kalmun filter; for problems with linear dynamics it produces an optimal RMS error estimate across time as well as space. The key idea of the Kalman filter is that to obtain optimal estimates we must take into account the environment’s dynamics. We will model these dynamics as a linear differential equation,
d
-X = AX at
+ BU
(4.1)
where X is a vector of unknowns, A and B are matrices, and a is a white noise process. The sensor observations will be modeled as linear function of the unknowns and a second white noise process n, Y=CX+n
(4.2)
Then the optimal RMS estimate X of X is given by the following continuous Kalman filter
d -x dt
= AX
+ K(Y - cic)
(4.3)
where the Kalman gain matrix K depends on the matrices A, B, and C, and on the noises a and n. Equation 4.3 provides a robust, optimal method of estimating the unknown state variables of a nonstationary linear process. However interpolating disparity or contrast in this manner is typically very expensive,
F i p 2 A typical disparity interpolation problem. (a) Disparity data input to a 64 x 64 node interpolation problem, vertical axis is disparity. (b) Interpolated surface after one iteration (approximately 1 sec on a Sun 4/330).
Surface Interpolation Networks
437
Figure 3 (a) A scene, (b) inverse distances (disparities) for this scene, (c) the autocorrelation function for this scene, (d) the autocorrelation function after transformation to the wavelet coordinate system. because correlations in the data cause the matrices N, A, and K to have many nonzero bands. Thus it is desirable to transform the problem to a basis in which the input data is decorrelated. In such a basis the matrices will be nearly diagonal, so that the interpolated surface can be obtained efficiently. Figure 3 illustrates the performance of the wavelet transform at decorrelating image structure. Figure 3a shows a simple scene, and Figure 3b shows the inverse distances (disparities) for the same scene, obtained as described in Pentland (1987). Figure 3c shows one row from the autocorrelation matrix computed from these disparities. This autocorrelation is similar to that of a second-order Gauss-Markov process; note that significant correlations exist across more than 64 pixels distance. Figure 3d shows the same autocorrelation function after transformation to the wavelet coordinate system. As can be seen, in the wavelet coordinate system the autocorrelation matrix is very nearly diagonal, reducing the bandwidth of the matrices N, A, and K. 4.1 A Simple Example. As an example I will show how to construct a simple Kalman filter for disparity U and its time derivative U at each
Alex P. Pentland
438
point in an fi x fi image. In this simple example we will assume that we have noisy disparity measurements available at every image point; the sparse-data case is similar. In state-space notation our system of equations is d
[ u ] [ x 4[ 4
;II
U
=
(4.4)
+ [ ; ] a
where U and U are n x 1 vectors, I is the n x n identity matrix, and a is an n x 1 noise vector due to the unknown accelerations of the observed points. The observed variable will be estimated point-by-point disparities U,:
.[
U,=U+n
(4.5)
where n is an n x 1 vector of observation noise. The Kalman filter is therefore
d
u] u
=[:I+[ I;:
(uo-II
01[
q)
(4.6)
where KIand KZ are the n x n Kalman gain matrices for velocity and acceleration, respectively. We will assume that n and a originate from independent second-order Gauss-Markov noise processes. As illustrated in Figure 3, such processes are nearly completely decorrelated in a wavelet-defined coordinate system, so that their autocorrelation matrices become nearly diagonal. Thus we have that
where N and A are diagonal matrices. Given this approximation to N and A we may then determine the Kalman gain matrices in the standard way (Friedland 19861, which are
Substituting this result into equation 4.6 we obtain (4.9)
Surface Interpolation Networks
439
Letting U = @iU, and premultiplying by
Po
(AN-’)
@it we obtain (4.10)
- U)
as a;@, = I. As can be seen, the unknowns U are functions only of the measurements Uoand the diagonal matrices A and N. Consequently, in the wavelet coordinate system the Kalman filter equations are decoupled into n independent two-variable Kalman filters. The major consequence of this decoupling is that only O(n) computations and O(n) storage locations are required. Even in the variable-noise case (not discussed here due to space limitations) only O(n)computations are required, as even space-varying Markov N are approximately diagonal in the wavelet coordinate system. Results. Figure 4a shows the sixth frame from a synthetic image sequence of Yosemite Valley, as seen from the vantage point of a small plane flying down the center of the valley floor. A sequence of corresponding disparity images, the sixth of which is shown in Figure 4b, was generated by reprojecting a digital terrain map of the area from the same set of viewpoints. These 128 x 128 disparity images were then corrupted by the addition of uniformly distributed correlated noise (Power(w) = w-O.’ where w is spatial frequency) resulting in a sequence of disparity images with a signal-to-noise ratio of 1:1, as is illustrated by Figure 4c. Accurate estimates of disparity cannot be obtained by averaging successive frames of the noisy range images, because of the curved camera path and perspective distortion. Similarly, regularization of individual frames does not produce accurate estimates due to the large amount of correlated noise. These corrupted disparity estimates were fed into the Kalman filter described above. The computation required approximately 5 sec per frame on a Sun 4/330. The disparity estimates at frame six are shown in Figure 4d. Comparing these estimates to the true disparities, shown in Figure 4b, it can be seen that a good interpolation was obtained. At frame six the mean per-pixel error in the disparity estimate was 8.5% of the initial error (an improvement of 21 db) demonstrating the stability and efficiency of this formulation. 5 Possible Biological Implementations
Figure 5a illustrates the process of surface interpolation using wavelet receptive fields. The input data D are passed through a layer of neurons with receptive fields such as are shown in Figufe 1 at each location. This computes @iD,the wavelet transform of D. The activity of each neuron
Alex I? Pentland
440
Figure 4 (a) The sixth frame of a fly-through of Yosemite valley, (b) true disparities for this image, (c) true disparity plus additive correlated noise with a signal-to-noise ratio of 1:1, (d) Kalman filter estimates of disparity.
is then scaled by a factor dependent upon its central frequency, thus computing (An: I)-'*LD. Finally, each neuron's output is summed with a spatial distribution equal to its receptive field, thus computing the inverse wavelet transform and obtaining the interpolated surface U = iP,(AQ~ +I)-'iPfD. Note that this figure illustrates the case where S = I, more generally a second such mechanism is required in order to also compute S = i ~ $ + ~ . Figure 5b illustrates one way this computation can be mapped onto neurons. In this figure the input layer arborizes very locally, with the pyramidal cell's basal dendrites producing receptive fields shaped as in Figure 1. This transforms the input to the wavelet basis. The sensitivity of these neurons is presumed inversely proportional to their central frequency. The output axons then arborize with the spatial distribution
+
Surface Interpolation Networks
441
Layer 1 Layer 2.3
sparse, noisy, inpul dals
Layer 4
14 Inlerallons n n d d lor sparse dala
Figure 5: (a) Interpolation via regularization, (b) one possible neural implementation. (c) Interpolation via Kalman filtering, (d) one possible neural implementation.
of Figure 1 among the apical dendrites of a second layer of neurons, producing receptive fields similar to those of the basal dendrites. This produces the inverse wavelet transform, so that the output of the second layer of neurons is an interpolated surface. It is important to note that the required orthogonal receptive field structure can be learned from an initial, unspecific center-surround r e ceptive field. In fact, the wavelets shown in Figure 1 were derived in this manner, using a technique similar to that of Kohonen (1982) or Linsker (1986). The Kalman filter interpolation network is nearly identical to the regularization network. The only difference is that a simple neuron-by-neuron temporal feedback loop is required to implement the computation of equation 4.10, as illustrated in Figure 5c. This feedback loop combines the current excitatory input to each neuron with its activity at the previous instant in time, thus smoothing the neuron's output over time. One way of implementing this feedback loop is with recurrent axons, as shown in Figure 5d. Note that the recurrent axons must also arborize with the spatial distribution of Figure 1.
442
Alex P. Pentland
6 Summary
I have described two interpolation methods that use orthogonal wavelets to efficiently obtain good surface interpolations. These methods have a simple biological implementation, and use wavelets that are similar to those found in human spatial vision. The first network performed surface interpolation via regularization, and is applicable to single images or static environments. For changing environments, it is neccessary to employ optimal estimation techniques, for example, Kalman filtering, to integrate information across time. A network that accomplishes surface interpolation via Kalman filters was also described; it is the same as the regularization network except for the addition of a feedback loop. References Albert, B., Beylkin, G., Coifman, R., and Rokhlin, V. 1990. Wavelets for the Fast Solution of Second-Kind Integral Equations. Yale Research Report DCS.RR-837, December. Daubechies, I. 1988. Orthonormal bases of compactly supported wavelets. Commun. Pure Appl. Math. XLI, 909-996. Friedland, B. 1986. Control System Design. McGraw-Hill, New York. Kohonen, T. 1982. Self-organized formation of topologically correct feature maps. B i d . Cyber. 43, 59-69. Linsker, R. 1986. From basic network principles to neural architecture. Proc. Natl. Acud. Sci. U.S.A. 83, 7508-7512, 8390-8394, 877944783. Mallat, S. G. 1989. A theoy for multiresolution signal decomposition: The wavelet representation. IEEE Trans. PAMI 11(7), 674-693. Matthies, L., Kanade, T., and Szeliski, R. 1989. Kalman-filter based algorithms for estimatingdepth from image sequences. 1nt.J.Computer Vision 3,209-236. Pentland, A. P. 1987. A new sense for depth of field. IEEE Trans. PAMl 9(4), 523-531. Pentland, A. 1991. Cue integration and surface completion. Invest. Ophthulmal. Visual Sci. 32(4), 1197. Poggio, T., Torre, V., and Koch, C. 1985. Computational vision and regularization theory. Nature (London) 317, 314-319. Simoncelli, E., and Adelson, E. 1990. Non-separable extensions of quadrature mirror filters to multiple dimensions. Proc. IEEE 78(4), 652-664. Terzopoulos, D. 1988. The computation of visible surface representations. IEEE Trans. PAMl10(4),417-439. Wilson, H., and Gelb, G. 1984. ModGed line-element theoyfor spatial-frequencyand width discrimination. J. Opt. SOC.Am.A 1(1), 124-131. Received 8 July 1991; accepted 4 November 1992.
This article has been cited by: 2. Katsuhisa Hirokawa, Kazuyoshi Itoh, Yoshiki Ichioka. 1997. Real-time Optical Wavelet-Transform with Positive and Negative Signals. Optical Review 4:3, 366-369. [CrossRef] 3. Y. Xu, R. N. Miles. 1996. Experimental determination of bending strain power spectra from vibration measurements. Experimental Mechanics 36:2, 166-172. [CrossRef]
Communicated by Steven J. Nowlan
Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks Nathan Intrator' Institute for Brain and Neural Systems, Brown University, Box 1843, Providence, NO2912 USA
We present a novel classification and regression method that combines exploratory projection pursuit (unsupervised training) with projection pursuit regression (supervised training), to yield a new family of costlcomplexity penalty terms. Some improved generalization properties are demonstrated on real-world problems. 1 Introduction
Parameter estimation becomes difficult in high-dimensional spaces due to the increasing sparseness of the data. Therefore, when a low-dimensional representation is embedded in the data, dimensionality reduction methods become useful. One such method-projection pursuit regression (Friedman and Stuetzle 1981 (PPR)-is capable of performing dimensionality reduction by composition, namely, it constructs an approximation to the desired response function using a composition of lower dimensional smooth functions. These functions depend on low-dimensional projections through the data. When the dimensionality of the problem is in the thousands, even projection pursuit methods are almost always overparameterized, therefore, additional smoothing is needed for low variance estimation. Exploratory projection pursuit (Friedman and Tukey 1974; Friedman 1987) (EPP) may be useful in these cases. It searches in a high-dimensional space for structure in the form of (semillinear projections with constraints characterized by a projection index. The projection index may be considered as a universal prior for a large class of problems, or may be tailored to a specific problem based on prior knowledge. In this paper, the general form of exploratory projection pursuit is formulated to be an additional constraint for projection pursuit regression. In particular, a hybrid combination of supervised and unsupervised artificial neural network (ANN)is described as a special case. In addition, a specific projection index that is particularly useful for classification (Intrator 1990; Intrator and Cooper 1992) is introduced in this context. 'Present address: Computer Science Department, Tel-Aviv University, Ramat-Aviv, 69978 Israel. Neural Computation 5,443-455 (1993) @ 1993 Massachusetts Institute of Technology
Nathan Intrator
444
There have been many other attempts to combine unsupervised with supervised learning (Yamac 1969; Gutfinger and Sklansky 1991; Bridle and MacKay 1992). The formulation discussed below is based on projection pursuit ideas that generalize many of the classical statistical methods, and in our case, suggests a well-defined statistical framework, that allows formulation and comparison between these methods. 2 Brief Description of Projection Pursuit Regression
Let (X, Y) be a pair of random variables, X E Rd, and Y E R. The problem is to approximate the d-dimensional surface
f(x) = E[YI X
= X]
from n observations ( x ~ , y ~ .).!,.( x n , y n ) . PPR tries to approximate a function f by a sum of ridge functions (functions that are constant along lines) m
f(x) N CgjC$x) j=1
The fitting procedure alternates between an estimation of a direction Ci and an estimation of a smooth function g, such that at iteration j, the square average of the residuals
-r r-(xi) '1 = rij-l - gj(uj x i ) is minimized. This process is initialized by setting rio = yi. Usually, the initial values of uj are taken to be the first few principal components of the data. Estimation of the ridge functions can be achieved by various nonparametric smoothing techniques such as locally linear functions (Friedman and Stuetzle 1981), k-nearest neighbors (Hall 1989b), splines, or variable degree polynomials. The smoothness constraint imposed on g implies that the actual projection pursuit is achieved by minimizing at iteration j , the sum
2
$(xi)
+ c(gj)
i=l
for some smoothness measure C. Due to the fact that the estimation of the nonparametric ridge functions is not decoupled from the estimation of the projections, overfitting is very likely to occur in one of the low-order gj, thereby invalidating subsequent estimations. Obviously, if g is not well estimated, the search €or optimal projection direction will not yield good results.
Combining EPP and PPR
445
Several alternatives have been considered in addressing this problem: 0
0
Choose the ridge functions {gj} from a very small family of functions, for example, sigmoidals with a variable threshold. This eliminates the need to estimate the nonparametric ridge function, but increases the complexity of the architecture. This approach is widely used in artificial neural networks, and may partially explain their success. Estimate a fixed number of ridge functions and projections concurrently (as opposed to sequential estimation) provided that the ridge functions are taken from a very limited set of functions. Again this is used in the context of neural networks, due to the relatively small additional computational burden.
Additionally, one may attempt to 0
Partially decouple the estimation of the response function, or the estimation of each of the ridge regression functions from the estimation of the projections.
Ultimately, it is reasonable to combine all of the above. One such implementation is presented in the following sections. First, the issue of decoupling the estimation of the ridge functions from the estimation of the projections is discussed. 3 Estimating the Projections Using Exploratory Projection Pursuit -
Exploratory projection pursuit is based on seeking interesting projections of high-dimensional data points (Switzer 1970; Kruskall969,1972; Friedman and Tukey 1974; Friedman 1987; Jones and Sibson 1987; Hall 1988; Huber 1985, for review). The notion of interesting projections is motivated by an observation that for most high-dimensional data clouds, most low-dimensional projections are approximately normal (Diaconis and Freedman 1984). This finding suggests that the important information in the data is conveyed in those directions whose single dimensional projected distribution is far from gaussian. Various projection indices (measures for the goodness of a projection) differ on the assumptions about the nature of deviation from normality, and in their computational efficiency. They can be considered as different priors motivated by specific assumptions on the underlying model. To partially decouple the search for a projection vector from the search for a nonparametric ridge function, we propose to add a penalty term, which is based on a projection index, to the energy minimization associated with the estimation of the ridge functions and the projections. Specifically, let p(a) be a projection index that is minimized for projections with a certain deviation from normality. At the jth iteration, we
Nathan Intrator
446
When a concurrent minimization over several projections/ functions is practical, we get a penalty term of the form I
Since C and p may not be linear, the more general measure that does not assume a stepwise approach, but instead seeks I projections and ridge functions concurrently, is given by
In practice, p depends implicitly on the training data (the empirical density) and is therefore replaced by its empirical measure b.
3.1 Some Possible Measures. Some applicable projection indices have been discussed (Huber 1985; Jones and Sibson 1987; Friedman 1987; Hall 1989a; Intrator 1990). Probably, all the possible measures should emphasize some form of deviation from normality but the specific type may depend on the problem at hand. For example, a measure based on the Karhunen Ldve expansion (Mougeot et al. 1991) may be useful for image compression with autoassociative networks, since in this case one is interested in minimizing the L2 norm of the distance between the reconstructed image and the original one, and under mild conditions, the Karhunen Loeve expansion gives the optimal solution. A different type of prior knowledge is required for classification problems. The underlying assumption then is that the data are clustered (when projecting in the right directions) and that the classification may be achieved by some (nonlinear) mapping of these clusters. In such a case, the projection index should emphasize multimodality as a specific deviation from normality. A projection index that emphasizes multimodalities in the projected distribution (without relying on the class labels) has recently been introduced (Intrator 1990) and implemented efficiently using a variant of a biologically motivated unsupervised network (Intrator and Cooper 1992). Its integration into a backpropagation classifier will be discussed below. 4 A Variant of Projection Pursuit Regression: Backpropagation
Network In this section, we consider a parametric approach-the backpropagation network-as a variant of PPR. In this context the addition of an exploratory projection index is discussed.
Combining EPP and PPR
447
Backpropagation (Werbos 1974; Le Cun 1985; Rumelhart et al. 1986) has been chosen as a possible representative for the first two alternatives presented in Section 2, since it has become a useful tool for solving complicated pattern recognition tasks such as speech recognition (Lippmann 1989), and since the class of functions that can be approximated by a backpropagation type network is very large. This architecture (with an unlimited number of projections) can uniformly approximate arbitrary continuous functions on compact sets (Cybenko 1989; Hornik et al. 1989) as well as their derivatives (Hornik et al. 19901, and do so efficiently. Related results can be found (Carroll and Dickinson 1989; Funahashi 1989; Hecht-Nielsen 1989; Hornik 1991; Ito 1991). In this method, the error is efficiently propagated backward to the previous layer for modification of their synaptic weights (projections). The single hidden layer architecture is of the form
where u is an arbitrary (fixed) bounded monotone function. The form
is more suitable for classification tasks. Since this method can approximate any continuous function, great care should be taken so that the variance of the estimator is not large, namely, that the model does not "overfit" the training data (Wahba 1990; Geman et al. 1992, for discussion). This can be done using some form of complexity regularization (Barron and Barron 1988; Barron 1989; White 1990; 'Moody 1991) or by weight elimination penalties that aim to reduce the effective number of parameters in the model (Plaut et al. 1986; Mozer and Smolensky 1989; Le Cun et al. 1990; Weigend et al. 1991). The performance of the network is measured using a loss criterion, for example, mean squared error between the output and the target of the network (the class label). The estimation of the weights is done by minimizing the empirical average of the error via gradient descent of the form: h i j / a t = - d & / h , , where & = E,[&(x,w)],is the average contribution to the loss criterion of each of the random inputs x . 4.1 Adding EPP Constraintsto Backpropagation Network. One way of adding some prior knowledge into the architecture is by minimizing the effective number of parameters using weight sharing, in which a single weight is shared among many connections in the network (Waibel et al. 1989; Le Cun et al. 1989). An extension of this idea is the "soft weight sharing," which favors irregularities in the weight distribution in the form of multimodality (Nowlan and Hinton 1992). This penalty
Nathan Intrator
448
Figure 1: A hybrid EPP/PPR neural network (EPPNN).
improved generalization results obtained by weight elimination penalty. Both these methods make an explicit assumption about the structure of the weight space, but with no regard to the structure of the input space. As described in the context of projection pursuit regression, a penalty term may be added to the energy functional minimized by error backpropagation, for the purpose of measuring directly the goodness of the projections sought by the network. Since our main interest is in reducing overfitting for high-dimensional problems, our underlying assumption is that the surface function to be estimated can be faithfully represented using a low-dimensional composition of sigmoidal functions, namely, using a backpropagation network in which the number of hidden units is much smaller than the number of input units. Therefore, the penalty term may be added only to the hidden layer (see Fig. 1). The synaptic modification equations of the hidden units' weights become j -b i =
at
-€
aqw, [x>
b i j
+
aP(Wl7
' * *
7
WrI)
dWij
1
+(contribution of cost/complexity terms)
An approach of this type has been used in image compression, with a
penalty aimed at minimizhg the entropy of the projected distribution (Bichsel and Seitz 1989). This penalty certainly measures deviation from normality, since entropy is maximized for a gaussian distribution.
Combining EPP and PPR
449
5 Projection Index for Classification: The Unsupervised BCM Neuron
Intrator (1990) has recently shown that a variant of the Bienenstock, Cooper, and Munro neuron (BCM) (Bienenstock et al. 1982) performs exploratory projection pursuit using a projection index that measures multimodality. This neuron version allows theoretical analysis of some visual deprivation experiments (Intrator and Cooper 1992), and is in agreement with the vast experimental results on visual cortical plasticity (Clothiaux et al. 1991). A network implementation that can find several projections in parallel while retaining its computational efficiency, was found to be applicable for extracting features from very high-dimensional vector spaces (Intrator and Gold 1992; Intrator et al. 1991; Intrator 1992). The activity of neuron k in the network is Ck = Cixiwik wok. The inhibited activity and threshold of the kth neuron is given by
+
The threshold 6; is the point at which the modification function 4 changes sign (see Intrator and Cooper 1992 for further details). The function 4 is given by f#J(c,0,) = c(c - 0,)
The risk (projection index) for a single neuron is given by
The total risk is the sum of each local risk. The negative gradient of the risk that leads to the synaptic modification equations is given by
This last equation is an additional penalty to the energy minimization of the supervised network. Note that there is an interaction between adjacent neurons in the hidden layer. In practice, the stochastic version of the differential equation can be used as the learning rule.
5.1 Some Related Statistical and Computational Issues of This Projection Index. This section discusses some commonly asked questions regarding the connection of the above projection index to previous work in pattern recognition and statistics. Although the projection index is motivated by the desire to search for clusters in the high-dimensional data, the resulting feature extraction
450
Nathan Intrator
method is quite different from other pattern recognition methods that search for clusters. Since the class labels are not used in the search, the projection pursuit is not biased to the class labels. This is in contrast with classical methods such as discriminant analysis (Fisher 1936; Sebestyen 1962, and numerous recent publications). The projection index concentrates on projections that allow discrimination between clusters and not faithful representation of the data. This is in contrast to principal components analysis, or factor analysis, which tend to combine features that have high correlation (see review in Harman 1967). The method differs from cluster analysis by the fact that it searches for clusters in the low-dimensional projection space, thus avoiding the inherent sparsity of the high-dimensional space. The projection index uses low-order polynomial moments, which are computationally efficient, yet it does not suffer from the main drawback of polynomial moments-sensitivity to outliers. It naturally extends to multidimensional projection pursuit using the feedforward inhibition network. The number of calculations of the gradient grows linearly with the dimensionality and linearly with the number of projections sought. 6 Applications
We have applied this hybrid classification method to various speech and image recognition problems in high-dimensional space. In one speech application we used voiceless stop consonants extracted from the TIMIT database as training tokens (Intrator and Tajchman 1991). A detailed biologically motivated speech representation was produced by Lyon’s cochlear model (Lyon 1982; Slaney 1988). This representation produced 5040 dimensions (84 channels x 60 time slices). In addition to an initial voiceless stop, each token contained a final vowel from the set [aa, ao, er, iyl. Classification of the voiceless stop consonants using a test set that included 7 vowels [uh, ih, eh, ae, ah, uw, owl produced an average error of 18.8% while on the same task classification using backpropagation network produced an average error of 20.9% (a significant difference, p < 0.0013). Additional experiments on vowel tokens appear in Tajchman and Intrator (1992). Another application is in the area of face recognition from gray level pixels (Intrator et al. 1992). After aligning and normalizing the images, the input was set to 37 x 62 pixels (total of 2294 dimensions). The recognition performance was tested on a subset of the MIT Media Lab database of face images made available by Turk and Pentland (1991) which contained 27 face images of each of 16 different persons. The images were taken under varying illumination and camera location. Of the 27 images available, 17 randomly chosen ones served for training and the remaining 10 were used for testing. Using an ensemble average of hybrid networks (Lincoln and Skrzypek 1990; Pearlmutter and Rosenfeld 1991; Perrone
Combining EPP and PPR
451
and Cooper 1992) we obtained an error rate of 0.62% as opposed to 1.2% using a similar ensemble of backpropagation networks. A single backpropagation network achieves an error between 2.5 and 6% on these data. The experiments were done using 8 hidden units. 7 Summary
A penalty that allows the incorporation of additional prior information on the underlying model was presented. This prior was introduced in the context of projection pursuit regression, classification, and in the context of backpropagation network. It achieves partial decoupling of estimation of the ridge functions (in PPR) or the regression function in backpropagation net from the estimation of the projections. Thus it is potentially useful in reducing problems associated with overfitting, which are more pronounced in high-dimensional data. Some possible projection indices were discussed and a specific projection index that is particularly useful for classification was presented in this context. This measure that emphasizes multimodality in the projected distribution was found useful in several very high-dimensional problems. Acknowledgments
I wish to thank Leon Cooper, Stu Geman, and Michael Perrone for many fruitful conversations and the referee for helpful comments. The speech experiments were performed using the computational facilities of the Cognitive Science Department at Brown University. Research was sup ported by the National Science Foundation, the Army Research Office, and the Office of Naval Research. References Barron, A. R. 1989. Statistical properties of artificial neural networks. In Proc. IEEE Conf. on Decision and Control, pp. 280-285. IEEE Press, New York. Barron, A. R., and Barron, R. L. 1988. Statistical learning networks: A unifying view. In Computing Science and Statistics: Proc. 20th Symp. Interface, E. Wegman, ed., pp. 192-203. American Statistical Association, Washington, DC. Bichsel, M., and Seitz, P. 1989. Minimum class entropy: A maximum information approach to layered networks. Neural Networks 2,133-141. Bienenstock, E. L., Cooper, L. N., and Munro, P. W. 1982. Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex. 1.Neurosci. 2, 32-48. Bridle, J. S.,and MacKay, D. J. C. 1992. Unsupervised classifiers, mutual information and ‘Phantom Targets’. In Advances in Neural Information Processing
452
Nathan Intrator
Systems, Vol. 4,J. Moody, S.Hanson, and R. Lippmann, eds., pp. 1096-1101. Morgan Kaufmann, San Mateo, CA. Carroll, S. M., and Dickinson, 8. W. 1989. Construction of neural net using the radon transform. In International Joint Conference on Neural Networks, Vol. 1, pp. 607-611. IEEE Press, New York. Clothiaux, E. E., Cooper, L. N., and Bear, M. E 1991. Synaptic plasticity in visual cortex: Comparison of theory with experiment. Journal of Neurophysiology 66, 1785-1 804. Cybenko, G. 1989. Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2, 303-314. Diaconis, I?, and Freedman, D. 1984. Asymptotics of graphical projection pursuit. Ann. Statist. 12, 793-815. Fisher, R. A. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7,179-188. Friedman, J. H. 1987. Exploratory projection pursuit. I. Am. Statist. Assoc. 82, 249-266. Friedman, J. H., and Stuetzle, W. 1981. Projection pursuit regression. J . Am. Statist. Assoc. 76, 817-823. Friedman, J. H., and Tukey, J. W. 1974. A projection pursuit algorithm for exploratory data analysis. I E E E Transact. Computers C(23), 881-889. Funahashi, K. 1989. On the approximate realization of continuous mappings by neural networks. Neural Networks 2, 183-192. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias-variance dilemma. Neural Comp. 4, 1-58. Gutfinger, D.and Sklansky, J. 1991. Robust classifiers by mixed adaptation. IEEE Transact. Pattern Anal. Machine Intelligence 13, 552-567. Hall, P. 1988. Estimating the direction in which data set is most interesting. Probab. Theory Rel. Fields 80, 51-78. Hall, P. 1989a. On polynomial-based projection indices for exploratory projection pursuit. Ann. Statist. 17, 589-605. Hall, P. 1989b.On projection pursuit regression. Ann. Statist. 17, 573-588. Harman, H. H. 1967. Modern Factor Analysis, 2nd ed. University of Chicago Press, Chicago. Hecht-Nielsen, R. 1989. Theory of the backpropagation neural network. In International Joint Conference on Neural Networks, Vol. 1, pp. 593-606. IEEE Press, New York. Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4,251-257. Hornik, K.,Stinchombe, M., and White, H. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2,359-366. Hornik, K., Stinchombe, M., and White, H. 1990. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks 3,551-560. Huber, P.J. 1985. Projection pursuit. (with discussion). Ann. Statist. 13,435-475. Intrator, N. 1990. Feature extraction using an unsupervised neural network. In Proceedings of the 1990 Connectionist Models Summer School, D. S. Touretzky,
Combining EPP and PPR
453
J. L. Ellman, T. J. Sejnowski, and G. E. Hinton, eds., pp. 310-318. Morgan Kaufmann, San Mateo, CA. Intrator, N. 1992. Feature extraction using an unsupervised neural network. Neural Comp. 4,98-107. Intrator, N., and Cooper, L. N. 1992. Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks 5,3-17. Intrator, N., and Gold, J. I. 1992. Three-dimensional object recognition of gray level images: The usefulness of distinguishing features. Neural Comp. 5, 61-74. Intrator, N., and Tajchman, G. 1991. Supervised and unsupervised feature extraction from a cochlear model for speech recognition. In Neural Networks for Signal Processing - Proceedings of the 2992 IEEE Workshop, B. H. Juang, S. Y. Kung, and C. A. Kamm, eds., pp. 460-469. IEEE Press, New York. Intrator, N., Gold, J. I., Biilthoff, H. H., and Edelman, S. 1991. Three-dimensional object recognition using an unsupervised neural network: Understanding the distinguishingfeatures. In Proceedings of the 8th Israeli Conferenceon AICV, Y. Feldman and A. Bruckstein, eds., pp. 113-123. Elsevier, Amsterdam. Intrator, N., Reisfeld, D., and Yeshurun, Y. 1992. Face recognition using a hybrid supervised/unsupervised neural network. Preprint. Ito, Y. 1991. Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory. Neural Networks 4, 385-394. Jones, M. C. and Sibson, R. 1987. What is projection pursuit? (with discussion). J. R. Statist. SOC. Set. A(150), 1-36. Kruskal, J. B. 1969. Toward a practical method which helps uncover the structure of the set of multivariate observations by finding the linear transformation which optimizes a new ‘index of condensation’. In Statistical Computation, R. C. Milton and J. A. Nelder, eds. Academic Press, New York. Kruskal, J. 8. 1972. Linear transformation of multivariate data to reveal clustering. In Multidimensional Scaling: Theory and Application in the Behavioral Sciences, I, Theory, R. N. Shepard, A. K. Romney, and S. B. Nerlove, eds., pp. 179-191. Seminar Press, New York. Le Cun, Y. 1985. Une procedure d’apprentissage pour reseau B seuil assymetrique. In Cognitiva 85: A la Frontiere de I’lntelligence Artificielle des Sciences de la Connaissance des Neurosciences, pp. 599-604, Paris. (Paris 19851, CESTA. Le Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., and Jackel, L. 1989. Backpropagation applied to handwritten zip code recognition. Neural Comp. 1, 541-551. Le Cun, Y., Denker, J., and Solla, S. 1990. Optimal brain damage. In Advances in Neural Information Processing Systems, Vol. 2, D. Touretzky, ed., pp. 598-605. Morgan Kaufmann, San Mateo, CA. Lincoln, W. P., and Skrzypek, J. 1990. Synergy of clustering multiple backpropagation networks. In Advances in Neural Information Processing Systems, Vol. 2, D. S. Touretzky and R. P. Lippmann, eds., pp. 650-657. Morgan Kaufmann, San Mateo, CA.
454
Nathan Intrator
Lippmann, R. P. 1989. Review of neural networks for speech recognition. Neural Comp. 1(1),1-38. Lyon, R. E 1982. A computational model of filtering, detection, and compression in the cochlea. In Proceedings l E E E International Conferenceon Acoustics, Speech, and Signal Processing, Paris, France. Moody, J. E. 1991. Note on generalization, regularization and architecture selection in nonlinear learning systems. In Neural Networks for Signal ProcessingProceedings of the 1991 IEEE Workshop, B. H. Juang, S. Y. Kung, and C. A. Kamm, eds., pp. 1-10. Mougeot, M., Azencott, R., and Angeniol, 8. 1991. Image compression with back propagation: Improvement of the visual restoration using different cost functions. Neural Networks 4, 467-476. Mozer, M. C., and Smolensky, P. 1989. Using relevance to reduce network size automatically. Connection Sci. Ul), 3-16. Nowlan, S. J. and Hinton, G. E. 1992. Simplifying neural networks by soft weight-sharing. Neural Comp. 4,473493. Pearlmutter, B. A., and Rosenfeld, R. 1991. Chaitin-Kolmogorov complexity and generalization in neural networks. In Advances in Neural lnfortnation Processing Systems, Vol. 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 925-931. Morgan Kaufmann, San Mateo, CA. Perrone, M. P.,and Cooper, L. N. 1992. Improving network performance: Using averaging to construct hybrid networks. Proceedings of the CAlP Conference, Rutgers University, October. Plaut, D. C., Nowlan, S. J., and Hinton, G. E. 1986. Experiments on learning by back-propagation. Tech. Rep. CMU-CS-86-126, Carnegie-Mellon University. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, D. E. Rumelhart and J. L. McClelland, eds., pp. 318-362. MIT Press, Cambridge, MA. Sebestyen, G. 1962. Decision Making Processes in Pattern Recognition. Macmillan, New York. Slaney, M. 1988. Lyon’s cochlear model. Tech. Rep., Apple Corporate Library, Cupertino, CA 95014. Switzer, P. 1970. Numerical classification. In Geostatistics, V.Bamett, ed. Plenum Press, New York. Tajchman, G. N., and Intrator, N. 1992. Phonetic classification of TIMIT segments preprocessed with Lyon’s cochlear model using a supervised/unsupervised hybrid neural network. In Proceedings International Conference on Spoken Language Processing, Banff, Alberta, Canada. Turk, M., and Pentland, A. 1991. Eigenfaces for recognition. J. Cog. Neurosc. 3, 71-86. Wahba, G . 1990. Splines Models for Obsmational Data. Series in Applied Mathematics, Vol. 59. SIAM, Philadelphia. Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. 1989. Phoneme recognition using time-delay neural networks. l E E E Transact.ASSP 37,328339. Weigend, A. S., Rumelhart, D. E., and Huberman, 8. A. 1991. Generalization
Combining EPP and PPR
455
by weight-elimination with application to forecasting. In Advances in Neural lnforrnation Processing Systems, Vol. 3, R. I? Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 875-882. Morgan Kaufmann, San Mateo, CA. Werbos, P. 1974. Beyond regression: New tools for prediction and analysis in the behavioral sciences. Ph.D. dissertation, Harvard University. White, H. 1990. Connectionists nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks 3, 535-549. Yamac, M. 1969. Can we do better by combining 'supervised' and 'nonsupervised' machine learning for pattern analysis. Ph.D. dissertation, Brown University. Received 26 June1992; accepted 26 October 1992.
This article has been cited by: 2. Shimon Edelman, Sharon Duvdevani-Bar. 1997. Similarity, Connectionism, and the Problem of Representation in VisionSimilarity, Connectionism, and the Problem of Representation in Vision. Neural Computation 9:4, 701-720. [Abstract] [PDF] [PDF Plus] 3. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 4. David J. Field . 1994. What Is the Goal of Sensory Coding?What Is the Goal of Sensory Coding?. Neural Computation 6:4, 559-601. [Abstract] [PDF] [PDF Plus]
Communicated by Steven J. Nowlan
A Simplified Gradient Algorithm for IIR Synapse Multilayer Perceptrons Andrew D. Back Ah Chung Tsoi Department of Electrical Engineering, University of Queensland, St. Lucia 4072,Australia
A network architecture with a global feedforward local recurrent construction was presented recently as a new means of modeling nonlinear dynamic time series (Back and Tsoi 1991a). The training rule used was based on minimizing the least mean square (LMS)error and performed well, although the amount of memory required for large networks may become significant if a large number of feedback connections are used. In this note, a modified training algorithm based on a technique for linear filters is presented, simplifying the gradient calculations significantly. The memory requirements are reduced from O[n,(n, + nb)N,] to 0 [ ( 2 n , + nb)N,], where n, is the number of feedback delays, and N, is the total number of synapses. The new algorithm reduces the number of multiply-adds needed to train each synapse by n, at each time step. Simulations indicate that the algorithm has almost identical performance to the previous one.
1 Introduction
A solution to the problem of modeling nonlinear dynamic systems was proposed recently with the introduction of a global feedforward local recurrent network architecture (Back and Tsoi 1991a). This network is based on a multilayer perceptron structure, with synapses that have an infinite impulse response (IIR). Simulations have shown that the network is capable of better performance than a network with only finite impulse response (FIR) synapses. For large networks of this class, a significant amount of memory storage may be required for learning, due to the filtering required for the gradient term of each weight. The memory requirement for the network is O[n,(n, + #b)Ns] where n, is the number of feedback delays, nb is the number of feedforward delays (it is assumed that the order of every synapse is the same), and N, is the total number of synapses. Neurul Computation 5,456462 (1993) @ 1993 Massachusetts Institute of Technology
Algorithm for IIR Synapse Multilayer Perceptrons
457
One method of overcoming the memory and computational requirements of the network was presented by Back and Tsoi (1991b). In this case, a reduced complexity network was proposed that simplified the structure of the network for particular classes of problems. In this note, a modified gradient algorithm is presented that simplifies the learning algorithm, while retaining the same network architecture. The algorithm is presented in Section 2. Simulation results are given in Section 3, which indicate the performance of the algorithm is almost identical to the training rule presented in Back and Tsoi (1991a). 2 A Simplified Learning Algorithm
A description of the multilayer perceptron with global feedforward local recurrent structure is presented again here for convenience. Let the network have L 1 layers (I = 0,1,. . . ,L). Each layer has Nl neurons with outputs zf(t) (i = 1 , 2 , .. . ,N,)where I = 0 is the input layer and I = L is the output layer. An MLP with IIR synapses is defined by
+
(2.1)
$(t)
=
c
yf.#)
(2.2)
i=l
(2.3) where (2.4) (2.5) (2.6) and k = 1 , 2 , . . . ,NI where NI+Idenotes the output layer, q-Jx(t) = x ( t -j), n, and nb are, respectively, the delayed feedback and feedforward inputs to a neuron, and zk, is the bias. The polynomials B i k ( q - l ) and Aik(q-') are relatively prime with n, 2 nb. In the previous learning algorithm, the weights are adjusted by minimizing a performance criterion defined by (2.7)
Andrew D. Back and Ah Chung Tsoi
458
where = Y k ( 0 - zkL(t)
(2.8)
and yk(t) is the desired output at time t. The weights are adjusted according to (2.9) (2.10) where
Consider the calculation of the sensitivity components ai;,( t)/&!kj(t), and ai&(t)/ab$(t)from 2.9 and 2.10 (Back and Tsoi 1991a). We have, (2.12) (2.13) where the synaptic output y(t) is given by 2.3. In the linear adaptive filtering context, Hsia (1981)noted that a simplified version of this form of gradient filtering operation can be obtained by carrying out the filtering (via the 1lA(9-I) autoregressivefilter), and then delaying the output. This led to a reduced number of operations. In the current context, it is possible to employ the same technique in gradient computations. This results in only two filtering operations per synapse, as opposed to O(n,+ nb). This can be reduced further by performing the forward filtering operation B(q-')/A(q-') in two stages. This results in only one gradient filtering operation per synapse (for each learning step). Rewriting 2.3, we have (2.14)
Algorithm for IIR Synapse Multilayer Perceptrons
459
Hence the new values for 8i$(t)/a&(t), and d;6ikj(t)/ab&(t) can be found by delaying the terms in 2.14 and 2.15. Thus, (2.16)
= 9-j4(t)
(2.17)
Note that the filtering performed in 2.16 is done only for j = 0. The difference between the new regressor equations 2.16,2.17 and the original formulation in 2.12,2.13, is that in the simplified case, separate filters are not maintained for each regressor. Filtering is performed only at the output of the synapse (corresponding to j = 01, and then that filtered regressor is subjected to a delay (9-9 to obtain aiikj(t)/au:kj(t) for j = 1,2,.. . ,n,. Note that vik(t)= ~i-'(f)/A:~(9-~)is obtained in the forward pass. Equations 2.9, 2.10, 2.11, 2.16, and 2.17 form the new weight update equations. Previously, the memory requirements to train a synapse were o[n,(n, n b ) ] . With the new algorithm, this is reduced to O[(n, n b ) ] + O(n,) = 0[(2n, nb)]. The memory requirement for the network is 0[(2n, + nb)Ns]. The number of multiply-adds required during learning for each synaptic weight is reduced by n, operations at each time step. In the next section, simulation results are discussed and conclusions are made concerning the simplified gradient algorithm.
+
+
+
3 Discussion
To assess the performance of the new algorithm, simulations were carried out on a nonlinear system identification task. In the example considered here, the nonlinear system to be modeled is described by y(t) = sin
+
0.0154 O.O462q-' + 0.0462q-2 + 0.0154~j-~ ~ ( t ) ] (3.1) 1 - 1.999-l + 1.572q-' - 0.45839-3
where x( t) is a white noise input. An IIR MLP of 2 layers, 10 hidden units, and synaptic orders of (n,,n b ) = (7,6) (hidden layer) and (no,n b ) = (0,O) (output layer) was used to model the system. The network was trained using identical learning rates and initial weights for each algorithm. After convergence, each system was tested with a white gaussian input. It was observed that the mean square error in each case was very close to one another. The simplified gradient algorithm gave slightly better results,
460
Andrew D. Back and Ah Chung Tsoi
Figure 1: Mean square error performance of the initial IIR MLP algorithm and the simplified version. with an mse of 0.0013, as compared to 0.0018 for the previous algorithm (averaged over 200 data points after 250 x 103 iterations). The mean square error performance of each algorithm is shown in Figure 1, where it is evident that the algorithms have very similar learning characteristics. It is noted in Hsia (1981)that the transient performance of the algorithms are different due to the difference in the A(q-') weights between time t and t - 1. This is observed in the IIR MLP also as indicated in Figure 2. The final values of the corresponding weights learned by each algorithm are similar, but not identical. This is expected considering the differences in 2.12 and 2.13 compared with 2.16 and 2.17. A more important aspect than the differences in weights are the final pole-zero positions of the synapses due to the simplified algorithm. To ensure the dynamic performance of each corresponding synapse is the same when using the simplified algorithm in place of the initial version of the algorithm, the pole-zero positions should coincide in each case for the respective synapses. In the experiments performed, it is observed that this is indeed the case.
Algorithm for IIR Synapse Multilayer Perceptrons
461
. . . . Init. Alg.
.
8oooO
1OOOOO
3
E
0
2oooO
4-0
6oooO Samples
Figure 2 Parameter convergence of the initial IIR MLP algorithm and the simplified version. In conclusion, this technique offers a substantial savings on memory and computation during learning yet gives equivalent performance.
Acknowledgments The first author acknowledges support through a Research Fellowship with the Electronics Research Laboratory, DSTO, Australia. The second author acknowledges partial support from the Australian Research Council.
References Back, A. D., and Tsoi, A. C. 1991a. FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Comp. 3(3), 375-385. Back, A. D., and Tsoi, A. C. 1991b. Analysis of hidden layer weights in a dynamic locally recurrent network. Artificial Neural Networks, T. Kohonen,
462
Andrew D. Back and Ah Chung Tsoi
K. Makisara, D. Simula, and J. Kangas (eds.), pp. 961-966. Elsevier Science Publishers B.V., North-Holland. Hsia, T. C. 1981. A simplified adaptive recursive filter design. Proc. IEEE 69(9), 1153-1155. White, S.A. 1975. An adaptive recursive digital filter. Proc. Ninth Asilomar Conf. Circuits, Systems and Computers, Pacific Grove, CA, pp. 21-25. Received 29 October 1991; accepted 19 August 1992.
This article has been cited by: 2. Stefan C. Kremer . 2001. Spatiotemporal Connectionist Networks: A Taxonomy and ReviewSpatiotemporal Connectionist Networks: A Taxonomy and Review. Neural Computation 13:2, 249-306. [Abstract] [PDF] [PDF Plus] 3. P. Campolucci, F. Piazza. 2000. Intrinsic stability-control method for recursive filters and neural networks. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 47:8, 797-802. [CrossRef] 4. P. Campolucci, A. Uncini, F. Piazza, B.D. Rao. 1999. On-line learning algorithms for locally recurrent neural networks. IEEE Transactions on Neural Networks 10:2, 253-271. [CrossRef] 5. B. Cannas, S. Cincotti, A. Fanni, M. Marchesi, F. Pilo, M. Usai. 1998. Performance analysis of locally recurrent neural networks. COMPEL: The International Journal for Computation and Mathematics in Electrical and Electronic Engineering 17:6, 708-716. [CrossRef]
Communicated by Shun-ichi Am&
The Characteristics of the Convergence Time of Associative Neural Networks Toshiaki Tanaka Miki Yamada Advanced Research Laboratory, Toshiba R b D Center, 1 Komukai Toshiba-cho, Saiwai-ku, Kawasaki 210, Japan
The authors have analyzed the dynamics of associative neural networks based on macroscopic state equations and have shown that both a layered associative net and an autocorrelation type net have the same convergence property: If a recalling process succeeds, the network converges very fast to one of the memorized patterns. But if a recalling process fails, it converges very slowly to a spurious state or does not converge. This property was also checked by computer simulations on a large scale (N = 1000) neural network. Moreover, it is shown that the convergence time for a successful recall is of order log(N). If this convergence time difference is used, execution time and memory can be saved and it can be determined whether a recalling process succeeds or fails without any additional procedure.
1 Introduction The associative memory was proposed by Kohonen and Nakano independently (Kohonen 1972; Nakano 1972). Its dynamic behaviors have been well studied by many researchers (Amari 1972; Amari and Maginu 1988; Amit 1989; Cottrell 1988; Domany et al. 1989; Hopfield 1982; Meir and Domany 1987). The recalling process of associative networks is divided into two groups, successful recall and failed recall. If the network converges to one of the memorized patterns or near them, this process is called successful recall. If the network is trapped to a spurious state as an equilibrium or does not converge to a fixed point, it is called failed recall. From the information processing point of view, the distinction between successful recall and failed recall is required, especially in case of hierarchical associative networks because the generated meaningless output of the first layer becomes the input pattern of the second layer and meaningless activations spread. The distinction between successful recall and failed recall is equivalent to the distinction between a memorized pattern and a spurious state Neural Computation
5,463-472 (1993) @ 1993 Massachusetts Institute of Technology
Toshiaki Tanaka and Miki Yamada
464
as an equilibrium. So, it is an interesting problem on the dynamics of associative networks. Several ideas for such a distinction have been suggested. Amit (1989) plotted the average final overlaps and average retrieval times vs. the distance of stimulus from a memorized pattern. He discussed the relation between the basins of attraction and retrieval times. Parisi (1986) proposed asymmetric neural nets, in which only the time-independent outputs can be considered as meaningful outputs. Gutfreund (1988) introduced these ideas and discussed the importance of the problem of a hierarchical associative network. The convergence time is estimated in case of successful recalls (Komliis and Paturi 1988) and in worst cases (Florben 1991). But this problem has not been well analyzed. This paper deals with this problem on both a layered associative net and an autocorrelation type net. Macroscopic state equations that describe the dynamic behavior of these two types of nets have been obtained by Meir and Domany (1987) and Amari and Maginu (19881,respectively. Based on the former equation, the authors have analyzed the convergence time of the associative network in detail and have shown some characteristics on the convergence time are useful to make distinction beween successful recall and failed recall. 2 Macroscopic State Equations of Layered Associative Net
A layered associative net and its macroscopic state equation will be introduced in this section. The state Sf of a neuron i on a layer I (i = 1,.. . ,N)(I = 1,.. . ,L) takes binary values f l where N is the number of neurons in each layer and L is the number of layers. The state Sff' is determined by the following law
s!+'
=
sgn(hff')
(2.1)
where sgn(x) =
1, x 2 0 -1, x < o
Here, J:j is the connection weight value from neuron j on layer I to neuron i on layer I 1. And hi+' is the total input to neuron i on layer 1 + 1. The connection weight is defined as follows:
+
]k
where = (&)(v = 1,. . . ,K) is an N-dimensional memory vector defined in each layer 1. Each component in a memory vector takes binary
<:
Convergence Time of Associative Neural Networks
465
values f l randomly with equal probability 1/2. So, memory vectors are orthogonal in a stochastic sense. Let m1 be the average direction cosine between the state S,! and the memory pattern t,!,on layer I defined by
The dynamics of m1 can be obtained (Meir and Domany 1987; Domany et al. 1989) as follows: (2.4) 2
(A'+l)' = a + - e~p[-(m'/A')~] lr
(2.5)
where
(A')'
= a, =
KIN
AIL0
(K,N-, +OO)
Equations 2.4 and 2.5 are called the macroscopic state equations. The main results obtained by Meir & Domany will be listed (Meir and Domany 1987). 0
0
0
0
Equations 2.4 and 2.5 have three fixed points (two are stable and one is unstable) if a < ac(= 0.27) and one stable fixed point if a > Qc.
mE(a)exists so that mr converges to m* 2 1 if m' > mt(a)is satisfied and that m' converges to m' = 0 otherwise. rnE(a)is a monotonically increasing function of a(< a,) and tends to infinity if a > a,. For a << 1, the upper stable fixed point rn' has the form
This equation is an exact solution of a layered associative net. But it does not hold in autocorrelation type net. The approximated macroscopic state equations in this type of net is obtained by Amari and Maginu (1988).
Toshiaki Tanaka and Miki Yamada
466
Y
y=x
Figure 1: Schematicgraph of f ( x ) ment of the network dynamics.
(Q
< QJ. The arrows show the time develop-
3 Analysis of the Macroscopic State Equations
The macroscopic state equations 2.4 and 2.5 will be analyzed in detail in this section. Let HI+'be d+l
=
mffl/A/fl
Equations 2.4 and 2.5 are rewritten using this new variable to the following simple equation: d+l=f($)
(3.1)
where
From equation 3.2,f(x) ( x 2 0) is an S-shape function satisfyingf(0) = 0 and f(+w) = l/@. Schematic shapes of the function f ( x ) are shown in Figure 1 for a < at. The function f(x) has three fixed points 0, H:, H: for a: < ac(G0.27). It is clear that {$} is a monotonic decreasing (or increasing) sequence if 0 < & < M: or H: < id hold (or iii,' < i?i' < Hj).This property directly leads the monotonicity of the sequence {m'} for I2 2.
Convergence Time of Associative Neural Networks
467
2oo
150
0.20
L,(ml; a)100
0.25
50
0 .2
0
.6
.4
.8
1
m1
Figure 2: Numerical simulation results of L,(rnl;a)in using macroscopic state equation. Each value in the graph indicates the storage level a (e = 0.001). Let L,(m';a) be the minimum layer number at which the difference between m' and an equilibrium m* is less than E:
L,(ml;a) = min(l1
I m' - m* I<
E}
The sequence {m'} is monotonic and two different sequences {m'}, {m"} do not change the sign of m' - m" if these sequences converge to the same fixed point. So L,(ml; a) is a monotonically increasing function if 0 < m' < rn, or rn, < m1 and is a monotonically decreasing one if rn, < m' < m, where 0, m,, m, are the fixed points of equations 2.4 and 2.5 and satisfy rn, = Ei,,/Z, m, = Tiis&', and m' = E i ' f i . The convergence times L,(ml; a) are shown in Figure 2 for some a's (E = 0.001),which are obtained by calculating the macroscopic state equations 2.4 and 2.5 until it converges to an equilibrium. These numerical simulations show the monotonicity of LJm'; a) obtained. The increasing curve L,(m';a) for m1 > rn, cannot be seen when a << 1 because rn, N 1. Next, we estimate Lc(ml;a) quantitatively. When Eil < Ei,, Ei' converges to zero. Since f(B') is convex in the range of (0 Eic), the linear approximation of f(Ei') around Ei' = 0 gives a good approximation for Lc(rn';a). In general, the convergence time step T, at which a difference equation x,,+~ = px,, q (0 < p < 1) satisfies I x,, - x, I< E is estimated by
+
T -
log€
log I x1 - x*
logp
log P
I
(3.3)
Toshiaki Tanaka and Miki Yamada
468
where XI is the initial value and X. = q/(1 - p) is an equilibrium of the equation. When p > 1, the time needed to go away a distance E from the equilibrium x* is given by -Tc. Using equation 3.3, the convergence time step for Tii' is given by log€ log%? T,=--loga
where a =f'(O)
(3.4)
loga So L,(m';a) is estimated by
= 1/,-.
-
log[l + 2 / m ] 2 log u
log m1 loga
log E loga
(3.5)
4-
where E is replaced by E / , , / m since A' converges to when Ti7' becomes zero. For example, when a = 0 . 1 , ~= 0.001, and m1 = 0.1, the approximated convergence time Le(m';a)= 77. This approximation shows a good agreement with numerical simulation results of Figure 2. When Tii' is near Tii,, 5' converges to Tii, and the convergence time
T,
log E log b
= --
log I Tii' - Tii, log b
I
(3.6)
is obtained in the same manner where b = f'(Tii,). Putting ifi, N I/@', Tii' = m l / f i and replacing E by c/,/E, we get L,(m';a) =
log€ log b
--
log(1 - m') log b
(3.7)
and 1 =
-/,
1 +
+
a [ ( T a / ~ / a11
Hereafter, the recalling process will be called successful recall when the network converges to a fixed point m, and failed recall when the network converges to zero. The convergence time difference between successful recall (equation 3.7) and failed recall (equation 3.5) mainly depends on coefficients u and b. Numerical calculations show logallogb x 10 if a < 0.2. So the convergence times of successful recall are much smaller than those of failed recall. For example, when m1 = 0.9, a = 0.1, and E = 0.001, we obtain b = 0.02 and L,(m';a) = 1. When Tii' is near Mc, FI' converges to Tii, or zero. The convergence time Lc(ml;a ) is estimated by the sum of the time needed to move away from Tiic and the time needed to converge to the equilibrium from near
Convergence Time of Associative Neural Networks
469
it. The former time takes large values like equation 3.5 but the region where Lc(m';a ) is large at around 5' > E, is very small. So the "spike phenomena" is observed at around m' = m, shown in Figure 2. The convergence time L,(m'; a ) depends on the accuracy E . The problem is how to choose E to estimate the convergence time of finite size neural networks. The direction cosine m rtakes N + 1 discrete values 1, ( N - 2 ) / N , (N- 4)/N,.. . , -1 when N is finite. So it is sufficient to put E = 1/N. Then, it is easy to see that L,(m';a) is of order log(N) from equations 3.5 and 3.7. Numerical simulations of L,( m'; a ) based on the approximated macroscopic state equations obtained by Amari and Maginu are very similar to those of the layered associative net except for the critical value of a,. In this case, a, is about 0.16 (Amari and Maginu 1988). 4 Simulation Results of Autocorrelation Type Net
In the previous section, it is proved that the convergence time of a failed recall is much larger than that of a successful recall and that the convergence time is of order lo#) where N is the number of neurons of a layered associative net. In this section, it is examined by neural network simulation that these properties still hold even for an autocorrelation type of net, which is the discrete synchronous Hopfield net with Jii = 0 (i = 1,. . . ,N). Figure 3a and b shows the results of convergence times for a = O.OB,O.lO, respectively. The network output is synchronously updated until it converges to a fixed pattern. This criterion of convergence is different from that of the theoretical convergence time discussed. But there is little difference between them. In the figures, 0 indicates a successful recall (m*> 0.9) and + indicates a failed recall. We limit the iteration time to 100 and do not include cases in which the convergence time exceeds this upper limit. We often observe such cases for failed recall of m 1 < 0.4. In other words, the network always converges within an iteration time 100 for successful recall. From these simulation results, the convergence time distinction between successful recall and failed recall holds for this type of nets. And this tendency becomes clearer as a increases. It is concluded that in correlation-based associative networks the convergence time of failed recall is larger than that of successful recall. Figure 4 shows the average convergence times vs. the network sizes (the number of neurons) for a = 0.08. In the figure, 0, +, and 0 are the average convergence times over 100 successful recalls for 0.2 < m 1 < 0.3, 0.4 < m' < 0.5, and 0.8 < m1 < 0.9, respectively. The convergence time is shown to be of order log(N) for large values of m'. But for small values of m', it is not clear that the order estimate still holds. A more precise analysis and simulation for larger size networks are needed.
470
Toshiaki Tanaka and Miki Yamada
Figure 3: (a,b) Computer simulation of convergence time using neural network (N= 1OOO). 0 indicates a successful recall and + a failed recall. The horizontal bar is the initial direction cosine M I , and the vertical bar is the convergence time (iteration time). It is clear that the convergence time of successful recall is smaller than that of failed recall. 5 Discussion
An autocorrelation type associative net has two problems from the information processing point of view. One is that the convergence time depends deeply on the initial values. In some cases they need more than 10 times as much as those of successful recall. The other is that an automatic distinction between successful recall and failed recall needs to store each memorized pattern separately in order to calculate the direction co-
Convergence T i e of Associative Neural Networks
20
471
I
0.2-0.3 0 15
-
0.4-0.5
+
0
0.8-0.9 0
0
T
10
-
-
0
0
0
OO
-
0
0
OO
00
0 0
+++++
5 0 11
0
+0
+
0
+0 +oooo@
+
+ + + + o
0
0
-
on@
I
Figure 4: Average convergence time vs. network size. 0, +, and 0 are the average convergence times over 100 successful recalls for 0.2< m' < 0.3,0.4 < m' > 0.5, and 0.8 < m1 < 0.9,respectively. The convergence time is shown to be of order log(N)for large values of m'. sine between each memorized pattern and an obtained output pattern. Roughly speaking, the first problem is on the execution time, and the second one is on memory. These time and memory problems can be solved using the convergence time property that we have shown. If a network does not converge within a time limit, you can stop the execution of the network and ignore its output. This is because the network is shown to converge to a spurious state when the convergence time is larger than the time limit. On the other hand, if a network converges within the time limit, you can regard its process as successful recall and its output as a meaningful pattern. Acknowledgments We thank Professor S. Yoshizawa of Tokyo University for useful discussions and comments. References Amari, S. 1972. Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transact. Computers C-21,1197-1206.
472
Toshiaki Tanaka and Miki Yamada
Amari, S., and Maginu, K. 1988. Statistical neurodynamics of associative memory. Neural Networks 1,63-73. Amit, D. J. 1989. Modeling Brain Function-The World of Attractor Neural Networks. Cambridge University Press, Cambridge. Cottrell, M. 1988. Stability and attractivity in associative memory networks. Biolog. Cybernet. 58, 129-139. Domany, E., Kinzel, W.,and Meir, R. 1989. Layered neural networks. J. Phys. A: Math. Gen. 22, 2081-2102. Florkn, P 1991. Worst-case convergence time for Hopfield memories. IEEE Transact. Neural Networks 2(5), 533-535. Gutfreund, H. 1988. Neural networks with hierarchically correlated patterns. Phys. Rev. A37,570-577. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kohonen, T. 1972. Correlation matrix memories. IEEE Transact. Computers C-21, 353-359. Koml6s, J., and Paturi, R. 1988. Convergence results in an associative memory model. Neural Networks 1,239-250. Meir, R., and Domany, E. 1987. Exact solution of a layered neural network model. Phys. Rev.Lett. 59,359-362. Nakano, K. 1972. Associatron-A model of associative memory. IEEE Transact. Syst., Man. Cybern. SMC-2, 381-388. Parisi, G. 1986. Asymmetric neural networks and the process of learning. J. Phys. A: Math. Gen. 19, L675L680. Received 28 May 1992; accepted 11 November 1992.
This article has been cited by: 2. Zhang Chengxiang , Chandan Dasgupta , Manoranjan P. Singh . 2000. Retrieval Properties of a Hopfield Model with Random Asymmetric InteractionsRetrieval Properties of a Hopfield Model with Random Asymmetric Interactions. Neural Computation 12:4, 865-880. [Abstract] [PDF] [PDF Plus]
Communicated by Gerald Tesauro
Robustness in Multilayer Perceptrons P. Kerlirzin F. Vallet Laborutoire Central de Recherches, Thornson-CSF 92404 Orsuy fcedex),France
In this paper, we study the robustness of multilayer networks versus the destruction of neurons. We show that the classical backpropagation algorithm does not lead to optimal robustness and we propose a modified algorithm that improves this capability.
1 Introduction
The distributed nature of information in neural networks suggests that they might be capable of "robust" computation, that is, a "graceful degradation" of performance with respect to damaged units and connections (Hinton et al. 1986;Le Cun 1987;DARPA 1988).In fact, such robustness is often observed empirically, but is usually uncontrolled. In this paper! we propose a method for optimizing during learning the ability of a network to compute after damage. Robustness to the destruction of neurons can be an interesting property for hardware implementation. One can note also that the problem of missing data corresponds to the destruction of input cells of the network. The degradation of a network due to a loss of precision of the synaptic weights ("clipping") has been studied (Hopfield 1982; Amit et al. 1985; Cottrell et al. 1987;Wong and Sherrington 1989)for associative memories (moderated drop of the capacity) and (Vallet and Cailton 1990)for linear classifiers. Here we study the case of the destruction of hidden and input neurons in a one-hidden-layer perceptron. The destruction of synapses can be tackled with similar solutions. The generalization to the case of a perceptron with more than one hidden layer is easy. Previously we have partially studied the case of linear networks (Kerlirzin 1990).We showed that the robustness optimizing algorithm behaves fairly, confirming theoretical results, and that it optimizes the robustness by distributing information on all the connections of the network. We now address the case of nonlinear networks. Neural Computation 5,473482 (1993) @ 1993 Massachusetts Institute of Technology
P. Kerlirzin and F. Vallet
474
2 Background
The network considered here performs a transformation F from R”to R”, through a hidden layer with p neurons (the output is linear for sake of clarity): P
X + F(X) = z~(’wl,X)W’, =Y i=l
where X is the input vector, Y the output one, o the neural transfer function [tanh(x) for our example), W, the input weight vector linking the input vector to the ith hidden cell, and Wi the output weight vector linking the ith hidden cell to the output vector (Fig. 1). Such a network tries to learn a desired function D (known on pattern examples: the learning set) with as few errors as possible. To achieve this, a cost function E, measuring the distance between the desired function D and the function F performed by the network (on the learning set) is minimized: N
E = C 11 D ( X ” ) - F(X”) [I2 ’”=I
the learning set being composed of N examples Xp. A method widely used to minimize this energy is a stochastic gradient descent procedure (Le Cun 1987; Rumelhart etal. 1986).Each convergence step (elementary learning) consists in modifying the weights W, and Wi in a direction opposite to the gradient, for the elementary contribution E(Xp) of the pattern Xfi to the total cost function E:
E ( X p ) =)I D(Xp) - F(X’”)[I2 We are interested here in partially damaged versions of the network, which implement the functions Fn: for which only a subset K of hidden cells is active:
F K ( X ) = z~(’wl,X)W’, iEE
K being a subset of {1,2,. . . p } , which represents the hidden neurons that are not destroyed. The measure of distance between the desired function D and the damaged one FK: is then N
[I D(Xp)- FK(X’”)It2
EK = p=l
The final goal is to minimize the average cost function, taking into account the probability P ( K ) of each configuration K occurring:
Robustness in Multilayer Perceptrons
input layer
475
hidden layer
output layer
Figure 1: Representation of the contribution of the ith hidden cells. We study here the simple and general example in which each cell has the same probability 7r to be damaged. The probability P ( K ) is thus given by
P ( K ) = (1 - 7r)lK17r(P-lKl)
(2.1)
1x1being the cardinal of K. It is now convenient to minimize the cost function N Ewer =
11 D ( x / ” -) F K ( X / ” )1)’
P(K)
(2.2)
K p=l
In order to minimize it, a stochastic gradient algorithm is used. It consists here in choosing one configuration K [randomly chosen according to the distribution given by P ( K ) l and one example X” (randomly chosen or chosen in a predefined order). A partial cost is defined for this configuration and this example:
Then the weights are modified in the opposite direction to the gradient direction of the elementary contribution E ( K , Xfi). One can note that the Probability P ( K ) is now implicitly taken into account by the distribution of the choice of the configuration K. The proposed algorithm is thus a doubly stochastic algorithm, that is, on examples and on damaged configurations.
476
P. Kerlirzin and F. Vallet
3 Simulations and Results
The efficiency of the proposed learning algorithm is studied on a realworld problem: the identification of radio-transmitters characterized by relevant features. This a (supervised) classification problem with 9 classes and vectors (patterns) of dimension 23. The learning set contains 810 examples and the test set (for evaluating the generalization rates) contains 400 examples. The network used is totally connected with 3 layers (23 x 16 x 9). This is the one-hidden-layer network that provides the best generalization rate (86.9%), whereas a network without hidden layer provides a generalization rate of 79%. We present here the generalization results (the only interesting ones in practice), the learning results have a similar behavior with higher values. We compare the results between the classical network that has learned without destruction of cells (called "classical") and the "optimized" network for which learning was made by a stochastic gradient with cost function (2.2). The probability of destruction of neurons during the learning phase is 10% (so P ( K ) = 0.91n10.1'6-lKI = 91K1/10'6). The robustness of each network when d hidden cells (d = 0,1,. . .) are destroyed is then measured. For each case we present two generalization rates: the average rate, which is calculated for a given d in taking into account all the possible combinations of d destroyed hidden cells, and the worst rate, which is the smallest rate found in testing all the configurations with d destroyed hidden cells. The proposed algorithm has been studied first with randomly initialized weights. In that case, the proposed algorithm is not efficient and the classical one seems to be robust. Another strategy was studied to obtain an optimized network that is efficient even without destroyed cells. This strategy consists in learning in two steps. The network is first trained with the classical backpropagation algorithm and, second, the robustness optimizing algorithm is used to continue its training. It has been observed that the supplementary learning time is of the same order as the fore-learning time. The obtained results are summarized in Table 1. The conclusions are now clear. First, the classical backpropagation algorithm is not optimal with respect to the destruction of neurons. The proposed algorithm hardly degrades the results from the classical one without destruction. When one considers the destruction of neurons, the worst and average cases are better for the optimized network. Further, given a number of killed cells, the optimized network exhibits a smaller performance standard (the standard deviation of the performance of the network for each possible configuration). This general solution can also be used in the case of a network with several hidden layers: learning is still made with random destruction
Robustness in Multilayer Perceptrons
477
Table 1: Comparison of Classical and Optimized Networks.
Worst case (%)
0 cell killed (0%) 1 cell killed (6%) 2 cells killed (12%) 3 cells killed (19%) 4 cells killed (25%) 8 cells killed (50%)
Average (%)
Classical
Optimized
Classical
Optimized
86.9 74.1 60.8 45.5
86.4 82.9 74.6 74.3 68.0 36.9
86.9 83.8 79.3 72.8 66.0 43.6
86.4 85.6 84.6 83.2 81.2 64.5
30.1 3.2
of neurons according to destruction laws, given for each layer. This approach should also be valid for the destruction of synaptic weights. It is interesting to study the effect of the mortality rate (probability of hidden neuron destruction during learning, A in equation 2.1) on the robustness of the network. The results are shown in Figure 2: average, worst cases and standard deviation of the generalization rates have been plotted versus the mortality rate A introduced during learning. These curves are shown for 6 different cases of the number d of destroyed cells in test phase: d = 0, 1, 2, 3, 4, and 8. It is interesting to notice that the recognition rates (worst and average cases) are roughly increasing functions of A and that their asymptotic values are reached for a value corresponding to d . Thus, if one wants to optimize a network for a destruction rate of A’, it is necessary to use a learning destruction rate A slightly greater than A’. The value of A need not be very precise, since we have shown that the generalization rate varies slowly with A. Another remarkable point is the decrease of the deviation of the results while A increases: the result with a given number d of destroyed hidden cells hardly depends on the choice of these cells. Finally, the proposed algorithm is able to solve the robustness problem under discussion. For d = 8 destroyed neurons during test, for example, the usual algorithm has an average test rate of 45% versus 75% for the proposed optimized version ( A = 50%). The previously proposed method can easily be applied to the case of the destruction of input cells. From an operational point of view, this is the problem of missing input data. Some data may arrive in an intermittent way or even completely disappear because of the failure of the upper processing step or of an out-of-order sensor. In this case, the aim is to have the most robust network with respect to this type of destruction. We have applied the learning algorithm described, in the case of input cells, to the problem previously described, and we show the results as previously. Figures 3 and 4 represent the generalization performances for the average and worst cases and the standard deviation for five values of the
I? Kerlirzin and F. Vallet
478
Averue nte on tat Ict
3 cells killed 2 cells killed
.....
I cell killed
- 0 cell killed
10
0
20
M
40
so
Word rate on teat r t
I,
. . I
10
.
..-. 8 cells kllled .--.4 calls kllled
-- .'*. 'b.
,
'. \.
.-.I.
.....
3 cells klUd 2 cells kllled I cell kUled
---..-.._.._..-.._. LO
m
M
40
Figure 2: Average and worst case and standard deviation of the generalization rate versus T probability of destruction of a hidden cell during learning. Six cases are represented, corresponding to a different number d of destroyed neurons during test: d = 0, 1, 2, 3,4, and 8.
479
Robustness in Multilayer Perceptrons
Average rate on test wt (%)
3 cells kllled 2 cells killed
- 0 cell killed 0
10
20
30
40
Worst rate on test set (96)
~
..
-
4 cells kllled (17%) 3 cells killed (t3%) 2 cells killed (9% ) t cell kllled (4%) 0 cell killed ( 0 % )
Learning mortality rate (%) 3 0 , 10
0
20
30
40
Standard deviation (%)
10.0
-
8.0
--'*
2.0
--
4 cells killed 3 cells kllled 2 cells kllled
-. '.
, Learning mortglity rate (%) ...
0.0 0
10
20
30
Figure 3 Missing input data. Average case, worst case, and standard deviation for the generalization rate versus s probability of destruction of an input cell during learning. Five cases are shown, corresponding to a different number d of destroyed neurons during test: d = 0, 1, 2, 3, and 4. Curves correspond to the randomly initialized network.
P. Kerlirzin and F. Vallet
480
Average rate on tesl set
90
..............................................................................................
86
82
...
--.
........... ............................................................................................ ...........*...... ................................ . .,...' ~
~
78 74
.,..-. -6
.,,,.
./
*.,'
,.*'
-*..
..*.
-.--
.__.4 cells kllled (17%)
....
--,/'
..........
3 cells kllled (13%) 2 cells killed ( 9 % )
-
I cell kllled (4%) 0 cell kllled ( 0 % )
.....
Learnlnp mortallty rate
70
-- n
p
Worst rate on test set .
I ............. ....... ...........................
1
........... .............................. ....... ................... ..-.'.". .................................................... , _...... ................. /,' _._._........ .. //.. ,,......... ..... ._-. 4 cells kllled ..... 3 cells kllled .... *........... 2 cells killed
8 O - t
70 60
so 40
~
~
~
~
--,.., '
.....
.a..
Learnlnp mortallty rate
3 0 , 0
2
--
10
20
........................................... .........
-
1 cell kllled 0 cell kllled
30
.............................................
.................... Learnlop mortallty rate
0
3
10
20
30
40
Figure 4 Same as Figure 3 with the fore-learning initialized network. number d of destroyed input cells during test. The figures on the left side correspond to the case of a randomly initialized network and those of the right side correspond to the case of an initialization by fore-learning. In opposition to the previous case, the fore-learning case does not seem to give notably better results than the randomly initialized case. For each d in {0,1,2,3,4,8},the value of the generalization rate for 7r = 10% of the
Robustness in Multilayer Perceptrons
481
randomly initialized case is higher than the corresponding value of the fore-learning case. 4 Conclusions
In this article, we have examined the problem of improving the robustness of a multilayer perceptron versus the destructions of hidden or input cells. We have shown that the classical backpropagation is not optimal when one considers robustness. We have proposed a learning algorithm that takes into account the potential destructions of cells (destructions during learning) and that was evaluated on a real example. The case of destroyed input cells seems different from the one of hidden cells. In the first case, a random initialization and a weak destruction of neurons during learning seem to give better results. In the second case a fore-learning initialization and a strongly damaged learning seem to be better. Our first results are very encouraging, but other experiments on real data have to be done, especially when the network size is large.
Acknowledgment The authors want to thank P. Gallinari for his support of this work.
References Amit, D. J., Gutfreund, H., and Sompolinsky,H. 1985. Storing infinite numbers of patterns in a spin-glass model of neural networks. Phys. Rev. Lett. 55(14), 1530-1 533.
Cottrell, D. M., Munro, P., and Zipser, D. 1987. Image compression by back-propagation: An example of extensional programming. In Advances in Cognitive Science, Vol. 3, N. E. Sharkey, ed. Ablex, Norwood, NJ. DARPA. 1988. Neural Network Study. AFCEA International Press, Fairfax. Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. 1986. Distributed representations. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. I. Bradford Books, Cambridge, MA. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kerlirzin, P. 1990. Robustesse et capacitk des rkseaux multicouches. Rapport de stage (DEA Paris XI Orsay), LCR ASRF/90-8. Responsable: F. Vallet. Le Cun, Y. 1987. Modiles Connexionnistes de I'Apprentissage. Ph.D. thesis, Universite Pierre et Marie Curie, Paris, France. Rumelhart, D. E., McClelland, J. L., and the PDP research group. 1986. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. II. Bradford Books, Cambridge, MA.
482
I? Kerlirzin and F. Vallet
1990. Recognition rates of the Hebb rule for learning Vallet, F., and Cailton, J-G. boolean functions. Phys. Rev. A 41(6), 30593065, Wong, K. Y. M., and Sherrington, D. 1989. Theory of associative memory in randomly connected boolean neural networks. I. Phys. A 22, 2233-2263.
Received 28 October 1991; accepted 6 August 1992.
This article has been cited by: 2. Gaetano Perchiazzi, Rocco Giuliani, Loreta Ruggiero, Tommaso Fiore, G??ran Hedenstierna. 2003. Estimating Respiratory System Compliance During Mechanical Ventilation Using Artificial Neural Networks. Anesthesia & Analgesia 1143-1148. [CrossRef] 3. Yu G. Smetanin. 1998. Neural networks as systems for recognizing patterns. Journal of Mathematical Sciences 89:4, 1406-1457. [CrossRef] 4. Sepp Hochreiter, Jürgen Schmidhuber. 1997. Flat MinimaFlat Minima. Neural Computation 9:1, 1-42. [Abstract] [PDF] [PDF Plus] 5. Terrence L. FineFeedforward Neural Nets . [CrossRef]
Communicated by Richard Lippmann
Pattern Discrimination Using Feedforward Networks: A Benchmark Study of Scaling Behavior Thorsteinn Rognvaldsson Department of Theoretical Physics, University of Lund, Solvegatan 14 A, $223 62 Lund, Sweden
The discrimination powers of multilayer perceptron (MLP) and learning vector quantization (LVQ) networks are compared for overlapping gaussian distributions. It is shown, both analytically and with Monte Carlo studies, that the MLP network handles high-dimensional problems in a more efficient way than LVQ. This is mainly due to the sigmoidal form of the MLP transfer function, but also to the fact that the MLP uses hyperplanes more efficiently. Both algorithms are equally robust to limited training sets and the learning curves fall off like UM, where M is the training set size, which is compared to theoretical predictions from statistical estimates and Vapnik-Chervonenkis bounds. 1 Introduction
The task of discriminating between different classes of input patterns has proven to be well suited for artificial neural networks (A"). Standard methods, like making cuts or discriminant analysis, are repeatedly being outperformed by nonlinear ANN algorithms, where the most extensively used algorithms are the feedforward multilayer perceptron (MLP) (Rumelhart and McClelland 1986) and the learning vector quantization (LVQ) (Kohonen 1990). Both algorithms have shown good discrimination and generalization ability, although some confusion prevails on their performance on realistic large sized problems, especially concerning their parsimony in parameters to fit data-an important issue when algorithms are transferred to hardware. This paper compares, analytically and with Monte Carlo simulations, the discrimination power of the MLP and LVQ algorithms on separating two gaussian distributions. Two classes are sufficient since the results carry over to problems with more classes. The problem is designed to resemble "real-life" situations with many input nodes and overlapping distributions, making the classification fuzzy. Discrimination is thus only possible down to a minimum error; the Bayes limit (Duda and Hart 1973). It is found, in contrast to previous results (Kohonen et al. 1988; Barna and Kaski 1990), that the MLP is more efficient than the LVQ algorithm on heavily overlapping distributions. The sensitivities of the two algorithms Neural Computation 5,483-491 (1993) @ 1993 Massachusetts Institute of Technology
Thorsteinn Rognvaldsson
484
Table 1: Bayes Limit for the “Hard” and “Easy” Cases for Dimensions 2 5 d 5 8 Dimension
Bayes limit (%)
d
“Hard case ”Easy” case
2 3 4
5 6
7 8
26.4 21.4 17.6 14.8 12.4 10.6 9.0
16.4 13.8 11.6 9.8 8.4 7.2 6.2
to limited training data are also examined and compared to theoretical predictions. 2
T h e Problem
The problem (Kohonen et al. 1988) consists of two overlapping gaussian distributions, PI and P2r of dimensionality d and normalized to unity with standard deviations (TI = 1.0 and 0 2 = 2.0. Two versions are generated; one where the distributions have the same mean, referred to as the “hard” case, and one where their means are separated by a distance El referred to as the “easy” case (notation follows Kohonen et al. 1988):
I &I
Pl(r) = (qd%)-dexp --
(2.1)
(2.2)
<
where = 0 for the “hard” case, and = (2.32,0,0,.. . ,0) for the “easy” case. Bayes limit equals Jmin[P1,P2Irwhich is easily calculated for gaussian distributions (see Table 1). 3 Analytical Results
The optimal classification boundary between the two gaussian distributions is in both the “hard and “easy” cases the surface of a d-dimensional hypersphere, defined by Pl(r) = Pz(r). This is the surface the networks try to reproduce. The MLP uses ”hyperplanes” (see Fig. la) and is able to “cut corners” if a sigmoidal transfer function is used. The LVQ network divides the space by means of reference vectors, so-called “tessellation,”
Pattern Discrimination Using Feedforward Networks
485
Figure 1: (a) The MLP approximation to a sphere and (b)the LVQ tessellation for the same sphere (dots correspond to reference vectors). (c) The resulting polyhedron and quantities used in the text. and if the number of reference vectors is moderate the LVQ algorithm will result in one reference vector being inside the sphere and the others outside (Fig. lb). Both algorithms thus produce a . polyhedral-like reproduction of the sphere (Fig. lc) and the generalization error E can be expressed as
E x B + AVAP
(3.1)
where B is Bayes limit, AV is the deficit volume of the polyhedron as compared to the hypersphere, and AP is the average absolute difference between the two distributions inside that volume. This error can be estimated for the "hard" case, where the polyhedron is assumed to be spherically symmetric: If the number of planes bounding the polyhedron is N, it will consist of N conical sectors with opening angle a (see Fig. lc). If N >> d the end-cap of the cone is a (d - 1)dimensional hexagon, approximately a (d - 1)-dimensional hypersphere, and a will be given by
2 ) , r is the gamma function.' It is assumed where A d = 2 ~ ~ / ~ / I ? ( d /and in the last step that N is so large that a is small. 'Ad comes from the surface area A = Ad#-' radius r.
of a hypersphere of dimension d and
Thorsteinn Rognvaldsson
486
The volumes of the cone and the sector are given by (3.3) (3.4)
J2d
where R = ln[q/u1]u:u;/(u; - u:) is the radius of the optimal classification hypersphere. The deficit volume is approximated by expanding in power series
where equation 3.2 has been used in the last step. If AV is not too large, the density of patterns inside can be assumed constant. For u1 < a2 and AR << R one gets
where PR is the value of the distributions at the border. Inserting equations 3.5 and 3.6 into 3.1 gives
E - B % C(d)dN-4/(d-I)o< dN-4/(d-') B
(3.7)
where
is approximately constant for 4 < d < 10 (see Fig. 2). The true value of C(d) is not exactly given by equation 3.8. Equation 3.7 should instead be considered as a general scaling relation for heavily overlapping distributions of dimension d < 10, under the condition that N >> d and o can be considered small. For comparison with Monte Carlo simulations a value of C(d) = 0.5 was used, Deriving a similar expression for the "easy" case would be arduous; the polyhedron is not spherically symmetric and the simplifications above are not possible. The generalization error is, however, bounded from above by expression 3.7 If + 00 the distributions are perfectly separated by one hyperplane, with E equal to zero, and if ( -, 0 the "hard case is recovered. Hence, the "easy" problem never scales worse than the corresponding "hard problem.
Pattern Discrimination Using Feedforward Networks
2
3
4
5
6
7
8
487
9
1
10
d
Figure 2: The factor C ( d ) for 2 < d < 10 when 01 = 1.0 and a;!= 2.0. Each reference vector in an LVQ network with NLVQ computational units has at most NLVQ- 1 nearest neighbors. For moderate NLVQ,with a setup similar to Figure lb, the number of hyperplanes bounding the polyhedron is N = (NLVQ - 1). On the other hand, more than one reference vector will be inside the sphere if NLVQis large and the number of - 1). The LVQ hyperplanes bounding the polyhedron will be N < (NLVQ generalization error will thus scale like equation 3.7 for moderate NLVQ and worse for large NLVQ. Each hidden unit in an MLP with a Heaviside transfer function corresponds to one hyperplane. Hence, an MLP with NMLP hidden units has N = NMLP. A smooth sigmoidal transfer function allows the MLP to “cut corners” in the polyhedron, and the generalization error subsequently scales better than 3.7. How much better is difficult to predict, but analytical results (Sontag 1992) imply that a sigmoid allows a decrease of the hidden units with at least a factor of two, compared to a Heaviside. 4 Monte Car10 Studies
The dimension d of the gaussian distributions varied between 2 and 8, and 12 different architectures of MLP and LVQ networks were set up for each value of d. An ensemble of 100 networks was trained for each configurationto measure the worst, best, and average discrimination performances. The LVQ networks had d input units and NLVQ processing units, where NLVQ E {d,d+4,d+8,. . . ,d+4.11}. The weights were initialized with the on-line k-means clustering algorithm (MacQueen 1967) for a duration of 150 epochs, with randomly selected starting points. LVQ updating was
Thorsteinn Rognvaldsson
488
then applied for an additional 150 epochs, while the learning rate was lowered geometrically from 77 = 0.1 down to 77 = 0.001. One epoch corresponded to 1000 patterns. This choice of parameters and initialization was made to match those of Kohonen et al. (1988). The MLP networks had d input units and one hidden layer with NMLP hidden units, where NMLP E {d,d 2,d 4 , . . . , d 2 . 11). The initial weights were randomly picked from a flat distribution w E [-wo,WO], with wo = 0.1/(maximum “fan-in”)? On-line backpropagation with summed square error was used to train the networks (in the ”hard” case a Langevin form was used; Aw = -vVE+”noise,” to avoid local minima). The learning rates were scaled in inverse proportion to the “fan-in” of the layer, 7 o< l/“fan-in,” and dynamically changed with the “bold driver” method; increasing 77 if the network is improving and decreasing it otherwise. The momentum parameter was kept constant at cr = 0.5. The largest architectures were also trained with 7 different training lo4}, to estimate the sets, with sizes M E {lo, 103/2,lo2,lO5l2,lo3, learning curve. For each value of M the number of epochs was changed to keep the total number of presentations constant at 4 x lo5 for the MLP and 3 x lo5 for the LVQ networks. All simulations were performed with the network program package JETNET 2.0 (Lonnblad et al. 1992). Figure 3 shows the E(N)-behavior for d = 8, training on continuously generated data (figures for d < 8 are similar). Indicated for the “hard” case is the predicted scaling relation of equation 3.7 with C = 0.5. The LVQ errors follow the predicted behavior well, whereas the MLP errors scale better, as ex ected. For the “easy” case the LVQ error scales like (E - B ) / B oc dN-’ 2d. The worst case upper bound generalization error for this kind of binary classification problem is expected to be E I dvc/M, where dvc is the Vapnik-Chervonenkis dimension of the network (Cohn and Tesauro 1992). Figure 4 shows that this bound is indeed well above the actual learning curves if the number of weights is used as an approximate value of dvc. The learning curves are also well described by E c( 1/(M + Mo), predicted by statistical learning theories for problems with continuous generalization spectrum (Schwartz et al. 1990), and the algorithms are equally robust to limited training sets. An attempt was also made to estimate the prior generalization spectrum. The generalization ability of 10,000 untrained networks was tested and the learning curves numerically calculated from
+
+
+
P
where @ ) o is the Mth moment of the prior generalization distribution (Richard and Lippmann 1991). The resulting curves (dotted lines in %e “fan-in” of a unit is the number of units feeding to it.
Pattern Discrimination Using Feedforward Networks
489
Figure 3: Minimum, maximum and average values of the scaled generalization error for d = 8 using (a) MLP and (b) LVQ on the “hard task, and (c) MLP and (d) LVQ on the “easy” task. N is the number of hyperplanes bounding the polyhedron. For the MLP N equals the number of hidden units, whereas for the LVQ N equals the number of reference vectors minus one. The dashed line in (a) and (b) is the predicted (E - B ) / B 0: dN-4/(d-1),whereas in (c) and (d) it is a fitted (E - B ) / B 0: dN-=Iz. Fig. 4) do not follow the true learning curves. Statistical errors, however, are large already for logM x 1.5 and higher statistics were not pursued due to the CPU consumption involved.
5 Conclusions and Discussion The results demonstrate that the MLP architecture, with sigmoidal transfer functions, is superior to LVQ for discriminating between heavily overlapping distributions with convex borders. This is in contrast to previous results comparing LVQ and MLP architectures (Kohonen etal. 1988; Barna and Kaski 1990), but in agreement with results for Boltzmann (Kohonen et al. 1988) and mean field machines (Peterson and Hartman 1989). The MLP method is highly efficient and the more viable alternative for problems with many inputs. Furthermore, both algorithms are equally robust to limited training data and their learning curves follow a 1/M behavior.
490
Thorsteinn Rognvaldsson
Figure 4: The "hard task learning Curves for (a) MLP and (b) LVQ, and the "easy" task learning curves for (c) MLP and (d) LVQ. M is the number of training patterns used. The solid line is a fitted (E - B ) / B 0: 1/(M + constant) and the dashed line shows W / M where W is the number of weights. The dotted lines are the curves numerically calculated from the prior generalization spectrum. These results are also valid for binary problems where more than two classes of input patterns are present. An MLP with n output units can be used for an n-class problem. The output values will then correspond to the Bayesian a posteriori likelihoods that the input pattern belongs to the specific class (Richard and Lippmann 1991). A discrimination decision can therefore be made from these output values and the number of hidden units will in the worst case just be multiplied by n. The number ~ scale multiplied of units in the LVQ network will for moderate N L valso by n and the relative performances of MLP and LVQ will hence be approximately unchanged. LVQ may of course be more efficient than MLP for extreme problems with very large n >> d and where each class only needs one or two reference vectors. The reason the MLP performs well already for low Nhrl~pis clearly its sigmoidal transfer function, making it possible to smooth the corners of the polyhedron. Allowing direct input to output connections in the MLP would increase the discrimination power further (Sontag 1992; Peterson and Hartman 1989). The LVQ algorithm can be augmented with more advanced evalu-
Pattern Discrimination Using Feedforward Networks
491
ations of the output signals (Barna and Kaski 1990; Scabai et al. 1992) and thus be made to perform better. With such superstructures one can achieve discrimination performances comparable to the MLP, but at the price of many more parameters.
Acknowledgments
This work has benefited from discussions with Bo Saderberg. References Barna, G., and Kaski, K. 1990. Stochastic vs. deterministic neural networks for pattern recognition. Phys. Scripta T33, 110-115. Cohn, D., and Tesauro, G. 1992. How tight are the Vapnik-Chervonenkis bounds? Neural Comp. 4,249-269. Duda, R., and Hart, P. E. 1973. Pattern Classification and Scene Analysis. John Wiley, New York. Kohonen, T. 1990. Self-organization and Associative Memory, 3rd ed. SpringerVerlag, Heidelberg. Kohonen, T., Barna, G., and Chrisley, R. 1988. Statistical pattern recognition with neural networks: Benchmarking studies. IEEE Second Int. Conf. Neural Networks I, 61-68.SOS Printing, San Diego, CA. Lonnblad, L., Peterson, C., and Rognvaldsson, T. 1992. Pattern recognition in high energy physics with artificial neural networks-JETNET 2.0. Computer Phys. Comrnun. 70, 167-182. MaQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math. Stat. and Prob., L. M. LeCam and J. Neyman, eds. University of California Press, Berkeley. Peterson, C., and Hartman, E. 1989. Explorations of the mean field theory learning algorithm. Neural Networks 2,475-494. Richard, M., and Lippmann, R. 1991.Neural network classifiers estimate Bayesian a posteriori probabilities. Neural Comp. 3,461-483. Rumelhart, D., and McClelland, J. (eds.) 1986. Parallel Distributed Processing, \ Vol. 1. MIT Press, Cambridge, MA. Scabai, I., Czak6, F., and Fodor, Z.1992. Combined neural network-QCD classifier for quark and gluon jet separation. Nuclear Phys. B374,288-308. Schwartz, D., Samalam, V., Solla, S., and Denker, J. 1990. Exhaustive learning. Neural Comp. 2, 374-385. Sontag, E. 1992. Feedforward nets for interpolation and classification. 1. Computer Syst. Sci., in press. Received 6 July 1992;accepted 20 October 1992.
This article has been cited by: 2. Thorsteinn Rögnvaldsson . 1994. On Langevin Updating in Multilayer PerceptronsOn Langevin Updating in Multilayer Perceptrons. Neural Computation 6:5, 916-926. [Abstract] [PDF] [PDF Plus]
Communicated by Richard Lippmann
A Neural Network That Learns to Interpret Myocardial Planar Thallium Scintigrams Charles Rosenberg Geriatrics, Research, Education and Clinical Center, VA Medical Center, Salt Lake City, UT 84148 U S A
Jacob Ere1 Department of Cardiology, Sapir Medical Center-Meir General Hospital, Kfar Saba, Israel
Henri Atlan Department of Biophysics and Nuclear Medicine, Hadassah Medical Center, lerusalem, Israel
The planar thallium-201 (201T1)myocardial perfusion scintigram is a widely used diagnostic technique for detecting and estimating the risk of coronary artery disease. Interpretation is currently based on visual scoring of myocardial defects combined with image quantitation and is known to have a significant subjective component. Neural networks learned to interpret thallium scintigrams as determined by both individual and multiple (consensus) expert ratings. Four different types of networks were explored: single-layer, two-layer backpropagation (BPI, BP with weight smoothing, and two-layer radial basis function (RBF). The RBF network was found to yield the best performance (94.8% generalization by region) and compares favorably with human experts. We conclude that this network is a valuable clinical tool that can be used as a reference ”diagnostic support system” to help reduce inter- and intraobserver variability. This system is now being further developed to include other variables that are expected to improve the final clinical diagnosis. 1 Introduction Coronary artery disease (CAD) is one of the leading causes of death in the Western world. The planar thallium-201 is considered to be a reliable diagnostic tool in the detection of CAD. Thallium is a radioactive isotope that distributes in mammalian tissues after intervenous administration and is imaged by a gamma camera. The resulting scintigram is visually interpreted by the physician for the presence or absence of Neural Computation
5,492-502 (1993) @ 1993 Massachusetts Institute of Technology
Myocardial Planar Thallium Scintigrams
493
defects-areas with relatively lower perfusion levels. In myocardial applications, thallium is used to measure myocardial ischemia and to differentiate between viable and nonviable (infarcted) heart muscle (Pohost and Henzlova 1990). Diagnosis of CAD is based on the comparison of two sets of images, one set acquired immediately after a standard effort test (BRUCE protocol), and the second following a delay period of four hours. During this delay, the thallium redistributes in the heart muscle and spontaneously decays. Defects caused by scar tissue are relatively unchanged over the delay period (fixed defect), while those caused by ischemia are partially or completely filled-in (reversible defect) (Beller 19911. Image interpretation is difficult for a number of reasons: the inherent variability in biological systems that makes each case essentially unique, the vast amount of irrelevant and noisy information in an image, and the "context dependency" of the interpretation on data from many other tests and clinical history. Interpretation can also be significantly affected by attentional shifts, perceptual abilities, and mental state (Cuar6n et al. 1980; Franken and Berbaum 1991). Neural networks have been applied to several problems in cardiology including the detection of stenosis (Porenta et al. 1990; Lee 1990; Cios et al. 1989; Cianflone et al. 1990; Porenta et al. 1988). These studies encouraged us to explore the problem of visual interpretation using neural networks, which, to our knowledge, has not been previously addressed.' 2 Data
Scintigraphic images were acquired for each of three views: anterior (ANT), left lateral oblique (LAO 45)) and left lateral (LAT)2for each patient case. Each image was first preprocessed3 and presented to the networks as circumferential profiles (Franciscoet al. 1982; Garcia et al. 1981); in which maximum pixel counts within each of 60,6" contiguous segmental regions are plotted as a function of angle (Garcia 1991) (see Fig. 1A). Cases were preselected based on the following criteria (Beller 1991):
Insufficient exercise. Cases in which the heart rate was less than 130 bpm were eliminated, as this level of stress is generally deemed 'Since acceptance of this paper for publication, we learned that another project has developed along similar lines (Dorffner et al. 1992). *Also sometimes referred to as LAO 70 or the "steep LAO view. 3Preprocessing involved positioning of the region of interest (ROD, interpolative background subtraction, smoothing and rotational alignment to the heart's apex (Garcia 1991): 4Theprofileswere generated using the Elscint CTL software package for planar quantitative thallium-201 based on the Cedars-Sinai technique (Garcia et al. 1981; Maddahi et al. 1981; Areeda et al. 1982).
C. Rosenberg, J. Erel, and H. Atlan
494
60
40 Degrees
0
360
Regional Scores
B.
0
0
0
0
0
0
0
0
01
0
Severe Moderate Mild Normal
ttt
- - n ANT
LAO 45
LAT 70
VIEWS
Figure 1: (A) Circumferentialprofiles for stress (top) and delayed (bottom)distributions for the left lateral view (LAT). The relatively lower levels of perfusion in the center of both graphs is evidence of a mild-to-moderate defect in the apical region. The defect remains after the delay period (bottom curve), suggesting that the defect is "fixed and likely due to the presence of scar tissue. (8) The general two-layer network architecture. See text for explanation.
Myocardial Planar Thallium Scintigrams
495
insufficient to accurately distinguish normal from abnormal conditions.
Positional abnormalities. In a few cases, the “region of interest” was not positioned or aligned correctly by the technician. 0 Increased lung uptake. Typically in cases of multivessel disease, a significant proportion of the perfusion occurs in the lungs as well as in the heart, making it more difficult to determine the condition of the heart due to the partially overlapping positions of the heart and lungs. Twenty percent of the cases were eliminated due to insufficient heart rate, and an additional 5% due to either positional abnormalities or increased lung uptake. A set of 100 usable cases resulted. Each case was visually scored for each of nine anatomical regions generally accepted as those that best relate to the coronary circulation: septal, proximal and distal; anterior, proximal and distal; apex, inferior, proximal and distal; posterior-lateral, proximal and distal. Scoring for each region was from normal (1) to severe (41, indicating the level of the observed perfusion deficit. Approximately 80%of the cases were visually interpreted by a single expert alone. Ten percent were read by all three experts, and 10%by two experts, with agreement in the scoring reached by consensus. 0
2.1 Methods. Four different types of networks were explored: a onelayer (sigmoidal) network trained using the delta rule, a two-layer network trained using backpropagation (BP), a two-layer network trained using BP combined with weight smoothing (to be described), and a twolayer network combining one layer of competitive radial-basis function (RBF)units and a second layer of (sigmoidal) units trained using the delta rule. The general two-layer network architecture is depicted in Figure 1B. All networks consisted of 180 input units and nine output units. The input units were divided into three groups. Each group of 60 units encoded the circumferential profile for each view. A relative coding scheme was used in which the 60 ordered segments constituting each view were divided by the segment with the highest scintigraphic count. Each input pattern was then normalized to unit vector length. Target numerical values were assigned to the categorical visual scores to make the data suitable for network learning: normal = 0.0, mild defect = 0.3, moderate defect = 0.7, and severe defect = 1.0. 2.1.1 Weight Smoothing. Weight smoothing “penalized adjacent fanin weights to the hidden units in proportion to their difference:
C11. .-- ( w 1.1 . - w .#,,-I) . + (wij - wi,j+l) (2.1) where j - 1 and j 1 are the left and the right “neighbors” of j . This rule
+
C. Rosenberg, J. Erel, and H. Atlan
496
is a special case of one proposed elsewhere (Lang and Hinton 1990). The value of Cv was subtracted from Awij on each weight update:
w.. 11 - w..9 + Aw.. 11 - PC.. ‘J In the present experiments, the smoothing parameter
(2.2)
P was set to 0.01.
2.1.2 Radial Basis Function Netzuorks. RBF networks (Moody and Darken 1989; Poggio and Girosi 1990; Broomhead and Lowe 1988) typically consist of two layers of units: one layer of radial units, typically gaussian, and a second layer of semilinear units. In our experiments, the activation value of a gaussian unit, 0,,is given by
(2.4)
where j is an index to a gaussian unit, i is an input unit index, and 1 1 ~ 1 1 is the length of the current input vector v (for a single view). The width of the gaussian, given by w, was fixed at 0.25 for all units. The gaussian units were trained using a competitive learning scheme that moves the center of the unit closest to the current input pattern (Om,,,i.e., the “winner”) closer to the input pattern: (2.5) In addition, the other units were also pulled toward the input pattern, although to a much smaller extent? We used a ratio of 1 in 100. The second layer was trained using the delta-rule (Widrow and Hoff 1960). Input to the second layer consisted of the activations of the entire set of nine gaussian units plus an additional unit per view to encode the scaling factor of the input patterns lost as a result of input vector normalization. This additional unit was clamped to the minimum value of the input pattern for that view? In our experiments, first the competitive layer was trained and then the supervised. 2.2.3 Training. Because of the limited number of cases available, we employed the “leave-one-out” or “jackknife” method of training. We trained 100 separate networks; each network was trained on a subset of 5Referredto as the leaky learning model in Rumelhart and Zipser (1986). 6The scaling factor was given as additional input to the output units in the other networks as well to ensure that the comparisons were fair.
Myocardial Planar Thallium Scintigrams
497
99 of the 100 cases and tested on the remaining one. This procedure enabled us to determine the generalization performance for each case. Each of the 100 networks was initialized in the normal way: weights were initialized to small random values between 0.5 and -0.5 and trained with learning rate (q)= 0.05 and momentum ( a )= 0.9. Training was terminated based on cross-validation testing on the remaining case. An optimal level of training was observed for all network types. This was found to be around 40 epochs for the backpropagation and backpropagation with smoothing networks, 100 epochs for the delta rule networks, and 300 epochs for the RBF networks (supervised training portion). Further training always led to overtraining and poorer generalization. The RBF competitive layer was trained for 20 epochs at a learning rate (v) of 0.1, and then for 80 epochs at 0.01. Momentum ( a ) was not used for competitive learning. 2.2 Results. Performance was examined as a function of the number of hidden (or RBF) units (Fig. 2). For all networks, an optimal number of units was found. For the RBF and the BI' networks, three hidden units per input group was optimal (Fig. 3). For the BP with smoothing networks, five was optimal.' To compare the performance more directly to the target consensus scores, we found decision thresholds that best separated normal and mild defects from moderate and severe, as measured by percent agreement with the consensus scores. In all cases (except for the single-layered network), this decision threshold value turned out to be 0.50 (k0.01).' Based on these decision thresholds, the performances of the networks in separating normal and mild defects from moderate and severe defects were computed (Fig. 4). The best performance was attained by the RBF network. This network agreed with the target consensus scores on 94.8% (853/900) of the regions, sensitivity 57.0%(47/82) and specificity 98.5% (806/818). Differences among the networks were tested using the nonparametric Friedman test with further evaluation using the two-tailed Wilcoxon signed-rank test for significance ( p < 0.05). Analysis was based on mean squared differences between outputs and targets (control). ) 11.244,p < 0.0251. The Network type was a significant factor [ ~ ; ( 3= single-layer network was significantly inferior to all the other networks ( p < 0.001). However no significant differences were found among the RBF, BP, and BP with smoothing networks. The BP with smoothing network was more accurate than the BP network, although this difference was not significant ( p < 0.1). 7Althoughwe will not pursue this here, the optimal number of hidden units for the smoothing networks seemed to interact somewhat with smoothing level p. With larger values of p, larger numbers of hidden units were found to be optimal. *The threshold for the single-layered network was found to be 0.78.
C. Rosenberg, J. Erel, and H. Atlan
498
Comparison of 4 Network Types ,019 '02
f .ole
0 LMS
'
A BP 0smoothed BP 0 REFS
.017
,012 n i l. .-.
Y
.
b
0
.
. - . . . - . .. . . . . . . . 2
4
6
8
10
12
14
16
. - . .. 18
20
22
Hlddrn Units
Figure 2 Average mean squared error (MSE) of the four network types as a function of the number of hidden units.
Figure 3: ROC curve (sensitivity vs. specificity) for the RBF network with three hidden units.
Myocardial Planar Thallium Scintigrams
499
Figure 4: Percent agreement (by region) with the target consensus scores of the best network of each type. There were nine regions for each of 100 cases or 900 regions in total. Indicates a statistically significant difference, based on MSE, at the p < 0.05 level. lY
2.2.1 Comparison with an Individual Expert. A randomly selected set of 52 cases was reinterpreted by one of the participating physicians. %oring was performed without regard to the original scores. Nine of the cases were previously rated by this physician, either individually or in consensus. In normal or mild versus moderate or severe defect discrimination, this physician’s scores agreed with the consensus scores on 97.6% (457/468) of the regions Isens: 89.5% (34/38), spec: 98.4% (423/430)1. On the same set of cases, the RBF network’s scores agreed with the consensus scores on 95.9% (449/468) of the regions [sens: 63.2% (24/38), spec: 98.8%(425/430)1, and agreed with the physician’s scores for 94.9% (444/468) of the regions [sens: 56.1% (23/41), spec: 98.6% (421/427)1. These differences were not found to be significant in a paired sign test. 3 Conclusions and Future Directions
A network with one hidden layer of RBF units was compared with networks trained with backpropagation, backpropagation with weight smoothing, and single-layer networks trained with the delta-rule. The best performance (94.8%generalization) was attained by an RBF network
C. Rosenberg, J. Erel, and H. Atlan
500
with three hidden units per view (nine in total). This level of performance did not differ significantly from that of an expert. Performance of the single-layer delta-rule network was significantly poorer than that of the other networks. The BP with weight smoothing yielded a moderate, though not statistically significant, improvement over standard BP. Improvements and extensions include the following:
Standardize visual scores. The data currently are a mixture of single expert and consensus scores. Given the likelihood of individual differences in interpretation (Franken and Berbaum 1991; Cuarbn et al. 1980),improved performance should be attained if single expert or standardized group data are used. Elicit confidence ratings. Expert visual interpretations could be augmented by degree of confidence ratings. Highly ambiguous cases could be reduced in importance or eliminated. The ratings could also be used as additional targets for the network9 cases indicated by the network with low levels of confidence would require closer inspection by a physician. Provide additional information. We have not yet incorporated the delayed distribution profiles, clinical history, gender, and examination EKG. The inclusion of these variables should allow the network to approximate more closeJy a complete diagnosis, and boost the utility of the network in the clinical setting. Add constraints. Currently we do not utilize the angles that relate the three views. It may be possible to build these angles in as constraints and thereby cut down on the number of free network parameters. Expand application. Besides planar thallium, our approach may also be applied to nonplanar 3-D imaging technologies and other nuclear agents or stress-inducing modalities. Preliminary results are promising in this regard. We believe that our study is the first to successfully apply neural networks to the prediction of the human expert’s visual interpretation of myocardial scintigraphy. Our network can be used today as a standard reference tool to help the physician reduce response bias and other types of scoring variability and may, with the improvements suggested, eventually equal or surpass the interpretive abilities of the physician.
Acknowledgments The authors wish to thank Prof. Benny Shanon for much kind assistance and support throughout this research project, Mr. Haim Karger for tech9See Tesauro and Sejnowski (1988) for a related idea.
Myocardial Planar Thallium Scintigrams
501
nical assistance, and the Departments of Computer Science and Psychology at the Hebrew University for computational support. We would also like to thank Drs. David Shechter, Moshe Bocher, Roland Chisin, and the staff of the Department of Medical Biophysics and Nuclear Medicine for their help, both large and small. This paper has also benefitted from the comments of two anonymous reviewers. Terry Sejnowski suggested our usage of RBF units. Charles Rosenberg was supported by grants from the Ministry of Science and Technology, Israel and from the Golda Meir
Fund.
References Areeda, J., Van Train, K., Garcia, E. V., Maddahi, J., Rosanki, A., Waxman, A., and Berman, D. 1982. Improved analysis of segmental thallium-201 myocardial scintigrams: Quantitation of distribution, washout, and redistribution. In Digital Imaging, P. D. Esser, ed. Society of Nuclear Medicine, New York. Beller, G. A. 1991. Myocardial perfusion imaging with thallium-201. In Cardiac Imaging, M. L. Marcus, H. R. Schelbert, D. J. Skorton, and G. L. Wolf, eds. W. 8. Saunders, Philadelphia. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Syst. 2,321-355. Cianflone, D., Carandente, O., Fragasso, G., Margononato, A., Meloni, C., Rossetti, E., Gerundini, P., and Chiechia, s. L. 1990. A neural network based model of predicting the probability of coronary lesion from myocardial perfusion sped data. In Proceedings of the 37th Annual Meeting of the Society of Nuclear Medicine, p. 797, May. Cios, K. J., Goodenday, L. S., Merhi, M., and Langenderfer, R. 1989. Neural networks in detection of coronary artery disease. Computers in Cardiology Conference, September, 33-37. Cuarh, A., Acero, A., Cbrdena, M., Huerta, D., Roddguez, A., and de Garay, R. 1980. Interobserver variability in the interpretation of myocardial images with Tc-99m-labeled diphosphonate and pyrophosphate. J . Nucl. Med. 21(1), 1-9. Dorffner, G., Prem, E., Mackinger, M., Kundrat, S., Petta, P., Porenta, G., and Sochor, H. 1992. Experiences with neural networks as a diagonostic tool in medical image processing. Unpublished manuscript. Francisco, D. A., Collins, S. M., and Go, R. T., Ehrhardt, J. C., Van Kirk, 0. C., and Marcus, M. L. 1982. Tomographic thallium-201 myocardial perfusion scintigrams after maximal coronary artery vasodilation with intravenous dipyridamole: Comparison of qualitative and quantitative approaches. Circulation 66(2), 370-379. Garcia, E. V. 1991. Physics and instrumentation of radionuclide imaging. In Cardiuclmuging, M. L. Marcus, H. R. Schelbert, D. J. Skorton, and G. L. Wolf, eds. W. B. Saunders, Philadelphia. Garcia, E. V., Maddahi, J., Berman, D. S., and Waxman, A. 1981. Space-time
502
C. Rosenberg, J. Erel, and H. Atlan
quantitation of thallium-201 myocardial scintigraphy. 1.Nucl. Med. 22,309317. Franken Jr., E. A., and Berbaum, K. S. 1991. Perceptual aspects of cardiac imaging. In Cardiac Imaging, M. L. Marcus, H. R. Schelbert, D. J. Skorton, and G. L. Wolf, eds. W. B. Saunden, Philadelphia. Lang, K. J., and Hinton, G. E. 1990. Dimensionalityreduction and prior knowledge in e-set recognition. In Advances in Neural Information Processing Systems 2, D. S . Touretzky, ed., pp. 178-185. Morgan Kaufmann, San Mateo, CA. Lee, S. C. 1990. Using a translation-invariant neural network to diagnose heart arrhythmia. In Advances in Neural Information Processing Systems 2, David S. Touretzky, ed., pp. 240-247. Morgan Kaufmann, San Mateo. Maddahi, J., Garcia, E. V., Berman, D. S., Waxman, A,, Swan, H. J. C., and Forrester, J. 1981. Improved noninvasive assessment of coronary artery disease by quantitative analysis of regional stress myocardial distribution and washout of thallium-201. Circulation 64, 924-935. Moody, J., and Darken, C. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247, 978-982. Pohost, G. M., and Henzlova, M. J. 1990. The value of thallium-201 imaging. N . Engl. 1.Med. 323(3), 190-192. Porenta, G., Dorffner, G., Schedlmayer, J., and Sochor, H. 1988. Parallel distributed processing as a decision support approach in the analysis of thallium-201 scintigrams. Computers in Cardiology Conference, September. Porenta, G., Kundrat, S., Dorffner, G., Petta, P., Duit, J., and Socher, H. 1990. Computer based image interpretationsof thallium-201 scintigrams: Assessment of coronary artery disease using the parallel distributed processing approach. Proceedings of the 37th Annual Meeting of the Society ofNuclear Medicine, p. 825, May. Rumelhart, D. E., and Zipser, D. 1986. Feature discovery by competitive learning. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds., Vol. 1, Chapter 5, pp. 151-193. MIT Press, Cambridge, MA. Tesauro, G., and Sejnowski, T. J. 1988. Aparallel network that learns toplay backgammon. Tech. Rep. CCSR-88-2, University of Illinois at Urbana-Champaign Center for Complex Systems Research, February. Widrow, B., and Hoff, M. E. 1960. Adaptive switching circuits. In 2960 IRE WESCON Convention Record, vol. 4, pp. 96-104. IRE, New York. Received 13 January 1992;accepted 28 September 1992.
REVIEW
Communicated by Franqoise Fogelman
The Use of Neural Networks in High-Energy Physics Bruce Denby Fermi National Accelerator Laboratory, M.S. 318, Batavia, IL 60520 USA In the past few years a wide variety of applications of neural networks to pattern recognition in experimental high-energy physics has appeared. The neural network solutions are in general of high quality, and, in a number of cases, are superior to those obtained using "traditional'' methods. But neural networks are of particular interest in high-energy physics for another reason as well: much of the pattern recognition must be performed online, that is, in a few microseconds or less. The inherent parallelism of neural network algorithms, and the ability to implement them as very fast hardware devices, may make them an ideal technology for this application. 1 Introduction
High-energy physics (HEP) is the field that studies the basic constituents of matter and the fundamental forces through which they interact. Recently, high-energy physicists have become interested in neural networks as HEP data analysis tools. It has been only a few years since the first investigations of neural networks for HEP were undertaken (Denby 1988; Peterson 1989), and much of today's work is still exploratory; however, the growth in applications to HEP is quite striking. At the Second International AIHEP Workshop (AIHEP 1992) at La Londe-les-Maures, France, in January 1992, 25 applications of neural networks in highenergy physics were presented. For comparison, at the first workshop in this series, in Lyon, France in March 1990, there were only two such presentations. In applications to date, neural networks have proven themselves to be more efficient classifiers than the simple cuts normally used in HEP, have allowed certain measurements to be made with smaller uncertainties due to their superior ability at function approximation, and have permitted analyses to be made even from heavily overlapping distributions due to their good approximation to Bayes probabilities. There have been some extremely interesting results using hardware neural networks: it appears possible to make rather sophisticated pattern analyses directly in the readout hardware of HEP experiments rather than in the standard, time-consuming offline analysis. Neural Computation 5,505-549 (1993) @ 1993 Massachusetts Institute of Technology
506
Bruce Denby
interaction region
accelerator ring
Figure 1: An accelerator with six interaction regions. Particles in bunches circulate in opposite directions, being brought together for collisions within the interaction regions. Normally only one or two particles within the bunches will actually collide during the crossing of two bunches. The Fermilab Tevatron has a diameter of about 1 mile. The SSC to be built in Texas will be about seventeen times larger. 1.1 HEP Accelerators. HEP data are produced in experiments at the large accelerator centers worldwide as detailed in Table 1. Each site features a “ring” in which opposing beams of particles are made to collide at one or more “interaction regions” (Fig. l).’ In the collisions, daughter particles of many kinds are produced, and these are detected in arrays of particle detectors surrounding the interaction region (see Fig. 2). The data from these detectors constitute the HEP data sets from which physics results must be extracted. As more powerful particle accelerators are built, the accompanying experiments grow tremendously, both in physical size and in the demands they place upon their data readout systems. Figure 2, detailing
Figure 2: Facing page. Elevation view of the CDF experiment at the Fermilab Tevatron. Only half of the apparatus is shown; it is symmetric about the point marked “interaction point.” In the text, applications of neural networks to track reconstruction in a central tracking chamber, vertex finding in a vertex chamber, electron finding in an endplug calorimeter, and muon identification in a muon chamber are presented. ‘There are also experiments-in which the extracted beam is directed onto a fixed target; for simplicity we shall not discuss these here.
Batavia,IL Geneva, Switzerland Hamburg, Germany Stanford, CA Tsukuba, Japan Ellis cty., Tx Geneva, Switzerland
Location p,p e+,ee-,p e+,ee+,ep,p p,p
Beams 0.9 x 0.9 TeV 50 x 50 GeV 27 x 820 GeV 5Ox50GeV 30 x 30 GeV 20 x 20TeV 10 x 10 TeV
Energy 26 psec 96 nsec 7msec 5 psec 16nsec 16 nsec
4 psec
Period 1986 1988 1992 1988 1986 1999 1999
Startup
CDF, DO Delphi, Aleph, Opal, L.3 H1, Zeus SLD,MarkII Amy, Topaz, Venus SDC,GEM Under discussion
Major Experiments
'e- stands for electron, e+ for positron, p for proton, and p for antiproton. These particles are discussed in more detail in the following section. The unit of energy is the giga- or teraelectron volt (GeV or TeV), and time is measured in microseconds (psec) or nanoseconds (nsec). The LHC and SSC are two large new machines scheduled to turn on before the end of the decade.
LHC
ssc
Fermilab CERN DESY SLAC KEK SSC CERN
Tevatron LEP Hera SLC Tristan
~~
Lab
Accel.
~
Table 1: Names and Locations of the Major World Accelerator Centers with the Type and Energy of Beam Used,Time Between Collisions of Particle Bunches, First Date of Operation, and the Names of the Major Experiments at the Site!
1 U Y
w
m
0
ul
Neural Networks in High-Energy Physics
509
the CDF [Collider Detector at Fermilab (CDF 198811 experiment at Fermilab, gives an idea of the scale and complexity of the detectors used in a current experiment. Detectors at LHC and SSC will be larger again by a factor of two or so. The volume of data produced in these detectors and the rate at which it must be analyzed are daunting. A typical experiment may record hundreds of thousands of individual detector channels, corresponding to about 1 million bits of information, for each collision, or “event,” as they are usually called, and it is not uncommon to record many millions of events during a data taking run. The particles within a beam are stored in “bunches.” The rate of collisions varies considerably from machine to machine, and is determined by the spacing between the bunches stored in the machine, since typically only one or two particles will actually collide in each bunch crossing. In all cases, the rates are rather challenging from the standpoint of real-time processing: at the Tevatron, bunch crossings currently occur every 4 psec; at the SSC and LHC, they will occur about every 16 nsec. The growth in data set size and complexity and the unprecedented data rates at today’s and future colliders have been the major motivating factors in the search for more powerful data analysis tools for HER In the discussion of HEP neural network application that will follow, it will be necessary to have some familiarity with the terminology associated with high-energy particle collisions and the detectors that record them. Sections 1.2 and 1.3 provide an introduction.
1.2 Areas of HEP Research-the Standard Model. Much of current research in HEP is involved with the completion and verification of the so-called “Standard Model” of particle physics. In this model, the basic constituents of matter are quarks and leptons as described in Table 2. The constituents interact with each other by ”exchanging”’ particles called “bosons,” as described in Table 3. Both quarks and leptons can interact via the electroweak force, carried by the W, Z, and y bosons. This force combines the electric force, responsible for such phenomena as electricity and magnetism, with the weak force that is responsible for radioactivity. Quarks can also interact through the strong force, which is carried by bosons called gluons, usually represented as g. Individual quarks and gluons are not observable. The naturally occurring particles are either single leptons, or a “composite” of two or more quarks as in Table 4. A proton, for example, is composed of two ”u” quarks and a “ d quark, which are bound together by exchanging gluons. Composites containing quarks are also referred to as “hadrons.” Leptons and hadrons interact differently in matter, as described in Section 1.3. 2The word, “exchanging” is used figuratively. The true interaction is a quantum process that defies classical explanation.
Down Strange charm Bottom Top Positron, electron Muon Tau Electron neutrino Mu neutrino Tau neutrino
v,, fi,
vp,fip
v,, fie
r+, T-
p+. p-
t,t e+.e-
b,b
c,c
s,%
up
d,d
Name
u,u
Symbol
0
0
+2/3, -213 -113, +1/3 -113, +1/3 +2/3, -213 -113, +1/3 +2/3, -2/3 +l.-1 +1, -1 +1, -1 0
Charge
O? O? O?
100 MeV 100 MeV 500 MeV 1.5 GeV 5 GeV 130 GeV? 511 keV 106 MeV 1.8 GeV
---
Mass
Ordinary matter composed of up and down quarks Strange matter exists in stars Discovered in 1973 Current area of study Not yet seen; much sought Causes chemical bonds Exist naturally in cosmic rays First heavy lepton discovered Not visible in detectors, except as 'missing' energy; masses thought to be zero
Comments
'The equivalence of matter and energy allows us to write masses in energy units of eV
Neutral leptons
Charged leptons
Heavy quarks
Light quarks
Type
Table 2 Quarks and Leptons and Their Properties,Including Mass and Electric Charge."
Neural Networks in High-Energy Physics
511
Table 3: The Force Carrying Bosons and Their Properties.“ Force
Symbol
Name
Charge
Weak
Wf, W-
W+, W-
+1, -1
Z
Z
Photon
0 0
91 GeV
Y g
Gluon
0
0
Weak Electromagnetic Strong
Mass
Comments
81 GeV Discovered at CERN in 1983; carriers of weak
0
force Light is composed of photons Binds quarks in composites
“The first column tells the type of interaction the boson mediates: weak, electromagnetic, or strong.
Table 4: The Composites Most Commonly Encountered in HEP Detector Systems. Type
Symbol Quark content
Charge
Mass Comments 939 MeV Atomic nuclei made of 940 MeV protons and neutrons 135 MeV Most commonly 500 MeV produced composites
Proton
P
uud
+1
Neutron
n
udd
0
Pion Kaon
A
K
ud, uu + dd us, ds
+1, -1,0 +1, -1,0
--
A collision between particles is, in the Standard Model theoryj an interaction between two of the elementary constituents that they contain. For example, in a collision between a proton and an antiproton, the “true” collision may be between a quark and an antiquark, between a quark or antiquark and a gluon, or between two gluons. When physicists examine the debris of such a collision, they are seeking information on the constituents and force carriers that are produced in the collision. Quarks and gluons emerging from a collision are not directly observable in the detector; they are said to “fragment” into “jets” containing many particles as they emerge from a collision. Jets from quarks and from gluons are slightly different in their properties, as will be discussed in Section 4.2. The most “fashionable” areas of research in HEP today are the study of the production and decay properties of the “heavy” (i.e., massive) quarks, c and b, the search for the heaviest quark, called “top,” or simply, “t,” which is postulated but as yet undiscovered, studies of the bosons W and Z, the search for the Higgs particle (Table 51, an essential but as
512
Bruce Denby Table 5: The Higgs Particle.
Symbol Name Charge Mass
I-F‘
Higgs
0
?
Comments Essential to theory; not yet seen; gives mass to particles
yet unobserved element of the Standard Model believed to be the origin of the masses of all particles, and the study of the characteristics of jets. 1.3 HEP Measurement Tools. Although there are quite a number of different types of measurement tools used in high-energy physics, most can be classified as one of two main types, tracking chambers and calorimeters. Figure 3 shows a generic HEP detector system with a central tracking chamber and a vertex tracking chamber, calorimeter with sections called “electromagnetic” and “hadronic,” muon shielding iron, followed by another set of tracking chambers called muon chambers. The figure illustrates the behavior of the detectors for the four most commonly encountered types of particles and for a jet. Tracking chambers are used to detect the trajectories of electrically charged particles emerging from a collision. Usually the tracking chamber volume is within a magnetic field. This causes the path of the charged particle to curve, enabling a measurement of the momentum3 of the particle. A knowledge of the momenta of all charged particles allows a complete study of the underlying dynamics of the collision to be made. When a charged particle passes through the chamber, gas molecules along its trajectory are ionized (there are also tracking chambers which do not use gas as an active medium, but we shall not discuss them here). Highvoltage wires spaced regularly throughout the tracking volume collect this ionization in the form of electrical pulses, which can then be passed on to the data acquisition system for analysis and reconstruction of the tracks. Position resolution finer than the wire spacing is obtained by using an electronic device to measure the time it takes for the ionization to drift to the wire. This is referred to as the “drift time.” In Figure 3 only wires closest to the trajectories, called “hit” wires, are shown. There are many different types of calorimeters but all have the same basic principle of operation. Calorimeters are normally built from many layers of metal interleaved with layers of a plastic or gas active medium. Quite the opposite of the tracking chambers, through which the particles pass uninterrupted, a calorimeter is designed to cause most particles incident on it to interact and deposit all of their energy within its volume. The energy may be in the form of ionization or of light, but will 3The momentum P of a particle is defined as P = Ev/8 where E is its energy, v is its velocity, and c is the speed of light. For nonrelativistic particles E m s , where m is the mass, giving p E mu.
Neural Networks in High-Energy Physics
513
ultimately be converted into an electrical impulse with a size proportional to the energy of the particle. Most calorimeters have two sections, called "electromagnetic" and "hadronic" of different composition. The electromagnetic section is designed to absorb almost all of the energy of the electromagnetically interacting particles, i.e., electrons and photons, while hadrons will deposit the largest fraction of their energy in the hadronic section. Calorimeters are usually highly segmented in order to give information on the spatial extent of the energy deposit from the particle, as shown in Figure 3, where the energy in each cell is represented by the height of the tower drawn at each cell. Note that the segmentation in the electromagnetic section is twice as fine as in the hadronic section. Calorimeters are particularly useful for identification of electrons. An electron will deposit almost all of its energy in a highly localized region
Figure 3: Behavior of a muon, electron, pion, neutrino, and jet in a HEP detector system. The beam pipe is perpendicular to the plane of the page. The muon passes completely through the calorimeters, depositing only a small amount of energy in each section, and through the shielding iron, to be finally detected in the muon tracking chambers. The electron deposits all of its energy in a localized region of the electromagnetic calorimeter. The pion deposits its energy over a region of both electromagnetic and hadronic calorimeters. The jet is composed of many particles of different types, mostly pions, and deposits energy both in electromagnetic and hadronic sections of the calorimeter over a broad region. The neutrino does not interact at all and passes undetected through the apparatus.
Bruce Denby
514
of the electromagnetic calorimeter. By looking for a charged track that points at this localized region, and matching the calorimeter energy to the track momentum, an electron can be reliably identified. Muons are charged particles that are capable of penetrating through great thicknesses of material with only minimal energy loss. For this reason, special muon tracking chambers are placed outside the calorimeter and a thickness of uninstrumented shielding iron in order to detect possible tracks from muons produced in a collision. The energy of other types of particles will be completely absorbed in the calorimeters and the shielding iron. The muon can be identified by measuring its momentum in the central tracking chamber and seeing if its projection through the calorimeter and iron matches well with a track ”stub” found in the muon chambers. The detectors’ response to pions, neutrinos, and jets is described in the caption of Figure 3. 1.4 Pattern Recognition in HEP-Standard
Methods.
1.4.1 Introduction. The only particles that are directly observable are those that have a natural lifetime long enough to allow them to be detected in the apparatus, that is, photons, muons, electrons, and some of the low mass composites such as pions and kaons. Neutrinos normally leave no trace in the apparatus and are detectable only by their “missing” energy. Most of the constituents produced in a collision quickly decay into these observable particles, or, in the case of quarks and gluons, fragment into jets containing many particles. The properties of the constituents must therefore be inferred from patterns in the “visible” particles into which they decay or fragment. Reconstructing an event involves two types of pattern recognition. The first, which we shall call low level pattern recognition, consists of such things as finding tracks in the tracking chambers or identifying a candidate electron in the calorimeter (Fig. 3). The second type, which we shall call physics process determination, uses more sophisticated features, for example, the angular distribution of the jets in the event, to try to identify the underlying physics of the interaction that took place. Note that this nomenclature is not the same as typically found in classical pattern recognition, since classification, normally considered “high-level,” can occur both in our low-level and high-level pattern re~ognition.~ In HEP, the distinction between high-level and low-level pattern recognition is based on the complexity of the features used to perform the classification. Examples of the two types will be given in the sections to follow. We shall see that neural networks have found application to both. 4Segmentation of the data into events is performed trivially using timing information that correlates a block of data with the time of a particular bunch crossing.
Neural Networks in High-Energy Physics
515
In HEP it is also necessary to distinguish whether the pattern recognition is to be performed “on-line,” that is, in real time, or “off-line.” On-line pattern recognition is performed on the data before it is logged, in a part of the experiment referred to as the “trigger.” Off-line pattern recognition is done with conventional computers operating on the data after it has been logged to permanent storage media. These two areas will be discussed in more detail below. 1.4.2 Triggering. New HEP experiments study increasingly rare physical processes. The implications of this for data acquisition systems are best illustrated by an example. One of the main motivations for the construction of LHC and SSC is the search for the Higgs particle. The probability of producing a Higgs particle when two protons interact is so small that such interactions would have to occur loM times per second5 in order to produce a reasonable sample of detectable Higgs particles, say 1000, during a 1-year run. The probability for other processes however, not involving the Higgs, is higher by a factor of about lOI3. This implies that, during this 1-year run, events containing background processes will be continuously produced at a rate of about 1 billion per second. It is neither desirable nor feasible to log all of these events to permanent storage media such as magnetic tape. On-line pattern recognition, called “triggering,” is required to reject background events and retain the rare interesting events. Although LHC and SSC represent an extreme case in high-rate HEP data acquisition, the problems are common to all HEP experiments. Figure 4 shows a typical multilevel HEP trigger system. The data from the detectors pass into the trigger as a stream of events, each containing all the detector data produced in a single collision. Each level of trigger rejects most of the events it receives and passes the remainder on to the higher level triggers. Levels 1 and 2 are typically implemented as fast specialized analog or digital hardware, while level 3 is a “farm” of conventional processors. The processing times and event rates shown at each level are generic, but typical of those encountered at current protonantiproton collider experiments such as the CDF experiment at Fermilab (CDF 1988); rates will be one to two orders of magnitude higher at LHC and SSC. In level 1, simple tests on global event information are performed, for example: (1) comparing to a threshold the summed transverse energy, Et = C,Ei sin Hi, where Ei is the energy in calorimeter cell i and Hi is the angle with respect to the beam axis of a line from the collision point to the calorimeter cell, (2) looking for the presence of a charged track with transverse momentum, Pt = Psin H, where P is the track mo5The lp per second is technically the accelerator ”luminosity” required to produce the lo00 Higgs particles. Luminosity is defined as the square of the number of particles per bunch, times the number of bunches per beam, times the revolution frequency of the bunches within the ring, divided by the cross-sectional area of the beams.
Bruce Denby
516
f
\
Detectors
specialized hardware neural nets?
(effective time,
'conventional' CPUS
Figure 4 Generic HEP multilevel trigger system. mentum, above a threshold, and (3) looking for the presence of one or more track segments in the muon chambers. The first and second cuts eliminate "soft" interactions. Most interesting physics processes involve "hard" scatters of two constituents in the beam particles, which produce particles at large angles to the beam direction and thus deposit in the
Neural Networks in High-Energy Physics
517
calorimeter substantial energy transverse to the beam direction. "Soft,'' glancing collisions of beam particles are much more copiously produced than hard scatters, and most must be rejected. The third cut is useful since high Pt muons are produced in many of the interesting processes currently under study, but are produced only with low probability in background processes. Level 1 triggers have a typical processing time of about 1 psec and reduce the rate due to backgrounds by about two orders of magnitude. In the level 2 trigger, somewhat more sophisticated tests can be done, for example: (1) looking for a match between a high-Pt track and an energy cluster in the electromagnetic calorimeter, indicating the presence of a candidate electron, or between a high?, track and a track segment in the muon counters, indicating a candidate muon, and (2) looking for the presence of localized clusters of energy in the calorimeter, which will correspond to jets, with Et above some threshold. Validating the presence of leptons and jets as in (1) and (2) above ensures that the event is more likely to have come from an interesting physics process. Ten to twenty microseconds are available for level 2 decisions. Level 3 triggers are executed using algorithms written in standard high level computer codes running on a "farm" of conventional processors that operate in parallel on separate events. As each event comes into level 3, it is immediately sent to an available processor. The processing done by level 3 can be quite sophisticated, in some cases being identical to the code used in offline analyses. Some of the typical analyses performed in level 3 are (1) reconstruction of charged tracks, (2) accurate calculation of the position of the collision point in order to reject events too far from the detector center, to allow more accurate calculation of Et of calorimeter cells, and to detect multiple vertices, (3) high-quality electron and muon identification using accurate Pt measurements of the tracks, (4) imposition of isolation cuts, that is, requiring that an electron or muon has very little energy surrounding it in the calorimeter, and (5)formation of composite triggers, for example, electron plus missing transverse energy plus one or more jets would be a good trigger for top quark production. Such calculations as these are too complicated to be performed in level 2. The time to process a single event in level 3 may be of the order of a second, however as there are many processors operating in parallel, the effective processing time is a few milliseconds per event. 2.4.3 Ofline Reconstruction. Offline reconstruction is the final event reconstruction in which all available information is processed using whatever data analysis techniques may be available. Normally all the data from a run will be processed in a single reconstruction pass in which data sets of special interest are created, for example, one for the physics of b and c quarks, one for the search for the top quark, one for W and Z physics, etc. These are often analyzed many times over with ever more
518
Bruce Denby
refined sets of selection cuts. Analysis usually proceeds with the definition of several feature variables on which one-dimensional cuts are placed. The use of likelihood techniques is also common. The offline analysis does not have the same real-time constraint as online reconstruction; however, the codes used to process high-energy physics data are normally tens of thousands of lines long and require substantial computing resources in order to complete the processing in a reasonable amount of time. It is not uncommon for a complete offline reconstruction of a particular physics process to take 1 or 2 years. 2 The Need for Neural Networks In high-energy physics, neural networks have been used both in real-time and offline applications. Most applications to date have used MLP's trained with backpropagation, although a few instances of the use of learning vector quantization (LVQ) and feature maps have also appeared. Recurrent networks have been applied to the problem of charged track reconstruction as discussed in Section 5.1. For the offline applications, the advantage to HEP is the same as that for other fields: near optimal classification with a minimum of computational overhead. In the real-time applications, neural networks present an advantage because of their parallel architecture, which allows for faster processing. We now discuss these two areas in more detail. 2.1 Neural Networks for Triggering. It is interesting to note that some of the functions performed by standard level 1 and 2 triggers as discussed above, that is, thresholding performed upon a linear combination of inputs, already resemble those performed by an artificial neuron. High-energy physicists building fast trigger electronics have for decades been making use of electronic devices called "discriminators" for performing this function. The idea of applying true neural network technology in HEP triggering, however, is quite new (Denby 1988; Denby et al. 19901, and it is far from being accepted as a standard tool. Research is currently under way to evaluate computing technologies for the trigger systems of experiments to take place at the new accelerators. At the SSC and LHC, the time between bunch crossings is so short, about 15 nsec, that no known technology can keep pace with the events as they come in. For this reason, a technique called pipelining is envisioned in most level 1 trigger schemes for SSC and LHC. The data from the events enter consecutively into a shift register. Data are clocked from one location in the shift register to the next at a time interval identical to the bunch crossing interval. At each location, part of the level 1 triggering algorithm is executed, so that by the time an event is ready to exit the pipeline, the entire level 1 algorithm will have been performed on
Neural Networks in High-Energy Physics
519
it. Thus, although an initial startup time is needed to ”fill the pipeline,” once it is full, a level 1 decision will be made every 15 nsec. It has not yet been decided how the trigger processing will be apportioned among the level 1, level 2, and level 3 triggers, nor have the exact technologies for the triggers been chosen. The construction of the new accelerators will take several years, and there is a strong tendency to avoid freezing technology choices too early. Some scenarios prefer a very sophisticated level 1 trigger which reduces the rate sufficiently to pass events directly to the level 3 processor farm (Farber et al. 1991; Crosetto and Love 1992), while others prefer a simpler level 1 coupled with high speed arithmetic processors in level 2, which reduce the requirements placed on the level 3 processor farm (SDC 1992). Algorithms making use of recurrency are not well amenable to pipelining and may require compromises to be implemented in level 1. Present day silicon neural networks, with settling times of the order of several hundred nanoseconds, are not appropriate for level 1 triggers; however, with advances in technology, much faster chips may be possible (see, e.g., Hansen 1992). But even currently available neural networks are sufficiently fast to give competitive performance for level 2 triggers. Technologies currently under study for the level 2 processors for the SSC are associativememories (Amendolia et al. 19901, neural networks (Denby et al. 1990), associative string processors (Lea 19881, and image processors (Bock et al. 1990). Although trigger systems using conventional electronics can probably be made to handle the rates to be found at SSC and LHC, neural networks can make the triggers far more efficient and less costly by moving to level 2 the complex pattern recognition normally done in level 3. In Section 3 we shall show some specific examples of this: accurate muon Pt measurement in a few microseconds, application of an isolation cut at level 2, a possible scheme for determining the position of the collision point online, etc. This will reduce the requirements placed on the level 3 processor farm and significantly reduce the amount of data that must be recorded on tape for later analysis. Another attractive feature of neural nets for triggering is their programmability. In the past, many level 2 triggers have been built as hardwired special purpose electronic devices. To change the algorithm in such a device implies rebuilding it or rewiring it. In a neural network, the algorithm can be changed simply by downloading a different set of weights, which will make neural network triggers much more flexible than their predecessors. 2.2 Offline Applications. Historically, high-energy physicists have eschewed “complicated data analyses in favor of simple one-dimensional cuts. In HEP, such problems as incomplete understanding of detector response, and heavy dependence on Monte Carlo models render the extraction of a final physics result from the experimental data an
520
Bruce Denby
extremely difficult and time-consuming task, sometimes requiring hundreds of man years of effort. There was a strong tendency to try to keep the analyses as simple as possible. However, over the years in HEP, considerable experience in detector construction techniques and in software generation has been gained, and detector simulation packages that model instrumental effects have become extremely sophisticated. Too, with the growth of collaboration size, particular groups of researchers within an experiment have been able to devote themselves exclusively to data analysis problems. The key to the value of neural networks in offline HEP analyses is in creating efficient cuts to retain events from rare physics processes while rejecting as many as possible of the background events. A further advantage is that neural networks may make possible certain analyses which previously were considered hopeless precisely because simple one-dimensional cuts were known to be ineffective discriminators. An example of this is the classification of quark and gluon jets, which we shall discuss in Section 4. It has been argued that although a series of one-dimensional cuts is less efficient than a multidimensional cut, this can be compensated for by taking more data. As the interesting physics processes to study become more rare, however, this reliance on increased statistics becomes impossible: it becomes necessary to extract as much information as possible from the data at hand.
2.3 The Problem of Training Data. One of the major goals of HEP is to identify and characterize the properties of as yet unseen constituents in the standard model. This, however, presents a problem for classification schemes involving supervised learning since there are no existing real data containing these particles. It follows that Monte Carlo data must be generated according to some model. In some cases, there are a number of rather different models to choose from. Any classification based upon these models will therefore be biased towards the model chosen. This is of course a problem for any type of classifier, however, a number of high-energy physicists are concerned that it will be more difficult to understand model dependence using neural networks than using a simpler type of classifier. This is used as an argument against using neural networks in HEP analyses. Although it is true that model dependence in a nonlinear classifier is somewhat more difficult to characterize than in a linear classifier, the superior performance of nonlinear classifiers has led some researchers to expend the additional effort necessary to characterize the model dependence. This will be seen in some of the applications described in Section 4. This effect is particularly important in triggering. Events rejected by a trigger will not be recorded, and so can never be used to check what the trigger was doing. For this reason, there has been a tendency in the past to keep trigger cuts as simple as possible to facilitate understanding of
Neural Networks in High-Energy Physics
521
the trigger efficiency. This “validation” problem is not important for triggers based on low level pattern recognition such as track segment finding or electron identification since modern detector simulations can quite reliably simulate such simple entities as tracks and electrons. However, because of possible biases from model dependence, there is still work to be done in HEP to show convincingly that unbiased information can be extracted from data taken with triggers that select specific physics processes, whether they use neural networks or more conventional technology. 3 Applications to Low Level Pattern Recognition
These applications, as well as those in later sections, are summarized in Table 6. 3.1 Trigger Applications. We will treat in this section only those trigger applications which have already been realized or have been seriously proposed. Some of the other low level pattern recognition applications which follow are also intended for triggering but are still just studies.
3.2 .2 First Real-time Application: Muon Trigger. The first real-time application of a neural network in HEP was accomplished recently at Fermilab (Lindsey et al. 1992). 3.2.2.1 Conventional method. Identification of a muon with a transverse momentum Pt above a threshold is a useful trigger for detecting decays of Ws, Zs, and b quarks since each will decay about 10% of the time to a muon. The cut on Pt is necessary since background processes produce many low Pt muons. A measurement of the Pt of a muon in the trigger requires a knowledge of the angle of the muon track at the muon chamber. Although, offline, the wire drift times can be used to calculate the track angle quite accurately, in the trigger, only the information on which wires were hit is available, resulting in an inaccurate measurement of Pt in the trigger. It is therefore necessary to set the Pt trigger threshold quite low in order to avoid discarding high Pt events that have been poorly measured. This introduces a large amount of background. 3.2.1.2 Test beam results. In a simple test beam experiment at the Fermilab Tevatron, slopes and intercepts of muon tracks traversing a small prototype drift chamber were calculated accurately, in real-time, using a commercial VLSI neural network chip incorporated into the standard drift chamber data acquisition system. This was a test experiment carried out in an auxiliary particle beam; in a full scale collider experiment, the drift chamber would be duplicated many times over to cover an area
522
Bruce Denby
Table 6 Summary of the Main HEP Neural Network Applications Covered in This Paper. Continued next page. Problem
Training set
Test set
Network
Results and comments
Low level-triggering Muon trigger test beam experiment Fermilab
15-64-64 MLP ETA”
Monte Carlo tracks
Real online data
Isolation and Real and b trigger for Monte CDF calorim. Carlo Fermilab data
Real and Monte Carlo data
Level 2 trigger for H1 expt. at HEM
Monte Carlo data
Monte Carlo data
19-??-1 MLP+silicon
Proposed: simulations show a 10-fold background rejection and 10 psec execution time, suitable for level 2
Electron i.d. for LHC at CERN
Monte Carlo data
Monte Carlo data
192-96-1 MLP 92-92-1 silicon
Simulation studies show good rejection; prototype chip had propagation time of 15 nsec, suitable for LHC or SSC
Find primary vertex at E735 expt. Fermilab
Real collider data
Low level-offline Real 18-128-62 collider MLP data
Kink finding for charged particle tracks
Monte Carlo data
Z decay probabilities into b, c, and (uds)
Monte Carlo data
data
Physics process Real data from Delphi expt.
50-4-1 50-10-1 (b) MLP+ETANN
14-7-1 14-14-1 (res.) 42-6-1
determination 19-25-3 MLP oneoutput node for each of b, c, (uds)
Fifty-fold improvement in position resolution over conventional trigger; to be applied to DO expt. muon upgrade Currently installed in CDF experiment; 50-1 net is 100%efficient and triggers on real data; other 2 in tagging mode
Overlapping 18 wire sections summed; 3 times better resolution than TOF; finds multiple vertices naturally Both parameter and residual neural net methods exceed performance of standard chi-squared technique; parameter method gives 20x speedup
Z decay probabilities into b, c, and (uds) measured more accurately than with standard method
Neural Networks in High-Energy Physics
523
Table 6 Continued. Problem
Training set
Test set
Network
Results and comments
Quark/gluon discr. at CDF Fermilab
Monte Carlo data
Real data from CDF expt.
8-6-1MLP (feature map in new analys.)
Heavy overlap of quark/gluon distrib; first evidence for quark fraction increase with E,
B tagging
Monte Carlo data
Monte Carlo and real data
MLP MLP
Numerous references on b tagging, mostly at LEP
Track recons. with DenbyPeterson net
Deformable templates/ elastic arms
+ LVQ
Recurrent nets and track reconstr. Hand wired Real data Fully from connected Aleph recurrent expt. field network None
Monte Carlo data
Dynamic system
Neurons are links between hits; links form tracks as system settles; tested on Aleph expt. Inspired by DenbyPeterson net and elastic net methods; not really a neural network
of many square meters surrounding the other measuring devices, as in Figure 3. The drift chamber sense wires signals appeared on Time to Voltage Converters (TVCs)that convert the drift time of the ionization to the wire into a voltage. The setup is shown in Figure 5. The beam dump in the figure simulates the shielding iron of Figure 3. The small circles in the drift chamber volume represent the wires and the small horizontal lines above and below represent the TVC values interpreted as a drift distance. Note that there is an ambiguity as to on which side of the wire the particle passed. The neural net must resolve this ambiguity. The wires in Figure 5 are paired vertically. For each of the three pairs, two signals are produced: a drift time and a latch, indicating whether the lower or upper member of the pair was hit. The drift time signals had to be duplicated four times in order to achieve sufficient fanout for the analog neural net chip. These 12 signals were coupled with the three latch signals to form the 15 inputs to the neural network chip, configured as an MLP. Sixty-four hidden units in a single layer were used. The output layer consisted of 64 units divided in a group of 32 to encode slope and a group of 32 for intercept [this type of readout has been used in several previous studies of tracking with neural networks (Denby et al. 1990b; Lindsey and Denby 1991; Lindsey 199111. Each output unit covers .625 centimeters in intercept or .05 radians in slope. The network was trained
Bruce Denby
524
trigger counters
Drift chamber trigger . electr. -c readout computer
4-
readout motherboard
* * TVcsiADC's
T
V
ETANN output
'for
T ETA" board
digitisation'
Figure 5: Setup for the drift chamber neural net trigger test. on 10,000 tracks generated with a simple Monte Carlo, using gradient backpropagation. Target patterns consisted of gaussian histograms with means equal to the target slope and intercept and rms width of one bin. Architectures with fewer hidden units were also tried, but these resulted in degraded performance. (In an analog hardware network such as this, extra hidden units may be needed simply to increase fanout.) The weights obtained were downloaded into an Intel Electronically Trainable Analog Neural Network chip (ETANN) after performing emulation and chip-in-the-loop training using the Intel E T A " Development System (Intel 1991). The intercept position resolution available using the conventional trigger technique, which does not make use of the drift times, is 5 cm.The neural network trigger was found to have a position resolution of 1.2 mm. This resolution is only about a factor of two worse than the best obtainable offline using the complete reconstruction algorithm, but is available in about 8 psec. The neural network result, as shown in Figure 5, can be passed back to the readout motherboard for readout with the rest of the event information, without introducing dead time in the data acquisition system. 3.1.1.3 Future plans. The drift chamber used in the above tests was a prototype of chambers that are currently installed in the DO experiment at Fermilab (DO1983). A group on the DO experiment is currently installing an E T A " chip on one of their chambers to take test data during the 1992 run (Haggerty 1992). They also plan to incorporate the E T A "
Neural Networks in High-Energy Physics
525
readout into the trigger of the upgraded DO detector in the 1994 run of that experiment (Fortner 1992). This will allow a more accurate determination of the muon Pt, which will allow the threshold to be lowered and sigruficantly reduce the amount of background data recorded.
3.2.2 Test Case: the CDF Experiment. Neural network trigger hardware is installed for the current run of the CDF experiment. We describe below the conventional CDF calorimeter trigger and the neural network improvements to it. 3.2.2.2 Conventional techniques. The trigger for the CDF experiment at Fermilab has been in operation since the first experimental run in 1987 (CDF 1988). In this trigger, signals from the calorimeter cells appear as analog levels (i.e., voltages) at the ends of special 200-foot cables, where they are received by the trigger receiver boards. From this point on, the trigger can be thought of as operating on an array of voltages of size 24 (azimuthal angle) by 42 (pseudorapidity, related to polar angle) by 2 (electromagnetic/hadronic compartment), which represent the energies in the calorimeter. Analog processing is used in level 1 and level 2 for the cluster analysis, in which the total ET of the cluster, the number of towers in the cluster, and the cluster width are computed. Once the cluster analysis is finished, additional digital processing is performed, operating on the cluster quantities using the level 2 processors and special function modules, for example, calculation of the ratio of the cluster's energies in the electromagnetic and hadronic calorimeters. 3.1.2.2 CDF neural network triggers. The existing CDF calorimeter trigger is very powerful, but is based on the philosophy that clusters can be adequately described by their position, their width, the number of towers they contain, and the ratio of hadronic to electromagnetic energy they contain. Indeed, this information is adequate for a great many triggers. However, there are instances when a more sophisticated cluster analysis would be fruitful. A neural network trigger is currently installed at the CDF experiment (Wu et al. 1990; Denby et al. 1991; Badgett et al. 1992). For every cluster found by the cluster finder, the new trigger selects 5 by 5 trigger tower region of interest (in hadronic and in electromagnetic compartments) centered on the cluster and passes the 50 analog signals to analog neural network chips (Intel 1991). The chips are programmed to execute three different cluster algorithms: (1) determine if the cluster could be an isolated photon in the central calorimeter, (2) determine if the cluster could be an isolated electromagnetic shower in the endplug6 calorimeter, and (3) determine if the cluster could have come from the 6The endplug is a name given to calorimeters or other detectors which fit into the end openings of the cylindrical central detectors (Fig. 2).
526
Bruce Denby
Figure 6 Isolation templates for plug electron trigger. semileptonic decay of a b quark.7 None of these analyses would be possible using the existing calorimeter trigger without extensive hardware modifications. We choose the isolated endplug electron/photon trigger as a simple illustrative example. There is a very high rate of clusters in the endplug, which pass the conventional electron trigger but are in fact due not to electrons but to background processes. In the past, a high-energy threshold was used in the endplug in order to reduce the false positive rate. This, however, is undesirable since it rejects a significant number of real electrons and photons along with the background. Good electrons and photons are normally isolated in the calorimeter (i.e., have very little energy surrounding them). In 1992, an isolation requirement, implemented by a neural network, was tried in the level 2 trigger to allow the same trigger rate but with a lower energy threshold. Normally such a cut would have been made in the level 3 trigger. The conventional level 2 trigger cannot implement this cut since it no longer has access to the individual tower energies after cluster finding. The neural net endplug isolation trigger operates upon 5 by 5 tower regions of the electromagnetic and hadronic calorimeters as shown in Figure 6 (only the electromagnetic part is shown in the figure). The dark central region is meant to contain the electromagnetic shower, which normally produces a narrow cluster in one or two towers. Four templates are necessary since some of the shower's energy may spill over into 2 to 4 towers and since the center of the tower as found by the cluster finder may not perfectly center it in the 5 by 5 array in all cases. Each template will be represented as a hidden unit in the neural network, and each tower has a weight connecting it to one of these hidden units. Cells in the central region have a weight of F, and cells in the outer region have a weight of -1. Thus, the quantity presented to the hidden units, which are used as comparators, is
'A semileptonic decay is one in which a quark decays to a lepton plus other particles. In a purely leptonic decay, the quark decays to a charged lepton and a neutrino.
Neural Networks in High-Energy Physics
527
If this quantity is negative, the hidden unit will not “fire”: the energy outside the central region was greater than some fixed fraction of the central region energy and the cluster is thus not isolated. If the quantity is positive, the neuron fires, indicating an isolated cluster. If any of the templates fires, the cluster is isolated, that is, the output unit simply sums up the outputs of the hidden units. The value F = 0.16 was found to be optimum in the present application. (Since the network is very simple, and essentially “hand wired,” it was not necessary to train the network using, e.g., backpropagation.) Using this value, in a simulation of the trigger operating on real data from a previous CDF run, it was possible to lower the energy threshold for endplug electrons and photons from 23 to 15 GeV, while reducing background by a factor of 4 and retaining 95% of the signal. The trigger has been checked out in the current CDF run and appears to function as designed; the efficiency and background rejection are still being evaluated. The isolated central photon trigger operates in an analogous way, except that it operates in the central region of the calorimeter rather than the endplug, and in this case has only one template with a single tower in the central region of the 5 by 5 grid. This trigger provides access to a class of physics events containing so called “direct” photons, which tend to be isolated in the calorimeter. The trigger has been measured to be 100 percent efficient, and without the isolation cut, the high rate of background would severely limit the amount of good data which could be taken. In the case of the semileptonic b trigger, a Monte Carlo program was used to generate events containing the semileptonic b jets and background events not containing b jets. The semileptonic b jets will contain an electron as well as other particles, while the background jets will not contain electrons. A full detector simulation was used in order to model as closely as possible any instrumental effects. A training set was made from 5 by 5 regions centered on the b jets extracted from the signal and background events. This was used to train a feedforward neural network with 50 inputs, one hidden layer of 10 units and a single output unit to discriminate between b jets and non-b jets. This is the only one of the three CDF neural network triggers which uses a network trained with backpropagation. The other two are ”hand wired” nets. A simulation of the trigger showed a reduction of background of a factor of about 100 while retaining 30% efficiency for bs. The weights found in the simulation will be loaded into the neural network chip in order to allow online identification of the b jets. The performance of this trigger on real data in the current CDF run is still being evaluated. It would be impossible to carry out a discrimination such as this using conventional computer hardware within the time limits of the level 2 trigger (i.e., about 20 psec). The hardware for these triggers is installed. The central photon trigger is actually triggering the detector; the other two triggers are still ”tag-
Bruce Denby
528
ging" data taken on other triggers pending full checkout. All three of the triggers are implemented with identical hardware. It is remarkable that such different algorithms can be implemented with the same hardware simply by downloading different weights. Future modifications to any of the algorithms will also be easy because of the programmability of the neural net. 3.2.3 Other Trigger Applications.
3.2.3.2 The H2 experiment. The Hera accelerator, which collides electrons upon protons, is just coming on line at the time of writing. The experiments H1 and Zeus there will study the momentum distribution of constitutents within the proton and measure the coupling strength of the gluon to the different quarks. At Hera, the rate of produced events due to background processes such as interaction of a beam article with a residual gas molecule in the vacuum system is some 10 larger than the rate due to physics processes of interest. In the H1 experiment, a 4 level trigger system is envisioned in order to reduce this high rate to a manageable level of about 100 Hz. Level 1 is a digital pipeline which reduces the rate by about a factor of 100. An additional reduction of a factor of 10 is required in level 2 in order to provide an acceptable rate into levels 3 and 4, which are implemented in software on conventional computers. The level 2 trigger must complete its processing within 20 microseconds. A hardware neural network has been proposed as a solution to this problem (Ribarics et al. 1991; Ribarics 1992a,b). We describe the approach below. In level 1, 16 simple trigger quantities, such as total summed energy, total summed transverse energy, total energy in the central region of the calorimeter, etc., are compared to thresholds. Level 1 however ignores correlations among the input variables. More sophisticated cuts will be made in level 2 by augmenting the level 1 quantities with additional information which becomes available after the level 1 decision time and feeding the resulting list of variables to a feedforward neural network. At present 19 input variables, including energy sums in subsets of the calorimeter, information on the vertex position, number of charged tracks, etc. are used. The neural network will use these 19 vanables to determine whether the energy patterns in the event have come from an electron/proton collision or from a beam-gas collision or other background. The detailed architecture of the neural network is still under development, however typical results using Monte Carlo data with an MLP show retention of 98% of events from interesting physics processes and rejection of 90% of background events, that is, the reduction factor of 10 is achieved while maintaining excellent efficiency. The algorithm is planned to be executed by a Siemens MA16 neural network chip (Ra-
s
Neural Networks in High-Energy Physics
529
macher et al. 1991), which should be able to finish processing in 10 psec, well within the allocated time.
3.1.3.2 Trigger R&D at CERN. Some of the research and development projects at CERN are investigating neural networks for triggering applications for the LHC accelerator. In one project, a type of detector called a "transition radiation detector," TRD, was designed to tell electrons from pions in an online trigger (Hansen 1992). The TRD will have 192 input wires, embedded in a special substrate, which sense the passage of the electron. The analog values from these wires will be fed into an MLP with 96 hidden units, and one output unit which signals whether or not an electron was present. In a simulation, the TRD rejected 92% of pions, and accepted 90% of electrons. These results were better than the 89% rejection, 90% acceptance obtained with a more traditional analysis. Ultimately the neural network will be implemented in silicon with fixed weights. A prototype chip has already been built which has 32 input units and 32 hidden units. The propagation time through the chip is 15 nsec; thus, the processing is sufficiently fast for incorporation into a first level trigger for LHC or SSC. A group at the Dutch lab NIKHEF is investigating a calorimetry-based neural network trigger for the LHC accelerator as part of a research collaboration at CERN (Vermeulen 1992). The approach is similar to the CDF trigger in that it will perform simple pattern matching upon energy patterns in local regions of the calorimeter. This is a 2-year pilot project which will compare the neural net solution to other techniques. The exact hardware implementation is still under development but will probably use a fast digital signal processor to implement the neural network algorithm. 3.2 Other Low Level Pattern Recognition Applications.
3.2.1 Track Segment and Vertex Finding. This discussion is from Lindsey and Denby (1991) in which data from a proton antiproton collider experiment were fed to an MLP trained to find the primary vertex of the event, based upon drift times in the z-chamber, a drift chamber with three layers of wires placed near the beam pipe. The primary vertex is the point from which the tracks in the event emanate, and marks the location of the collision. Figure 7 shows the hits in the chamber for a typical event; here, only the hit wires are shown, not the drift times. The hits appear to emerge from a point on or near the beam line. The vertex position in collider experiments is normally not available online. This would, however, be very useful since it could be used to improve trigger calculations which assume a nominal vertex position at the center of the apparatus, and to flag or reject events that contain multiple interactions (i.e./ more than one primary vertex). Vertex calculations
Bruce Denby
530
Z Chamber Event
... Sense
Z Chamber
I
.
:
-5Ocrn
1
~
.....
Beam Line
Wires Hit Sense Wire
~
~ ,
.
.
u
..
. . . ... (
~
~
.
... . . .. .. I.... 4
-'g
Ocm
~
+50cm
0 Track Fit Vertex
+TOF
Vertex
...-'*...Neural Net
Ouput
x Max Net Output
Figure 7 A typical proton/antiproton collision viewed in the z-chamber of the E735 experiment. are normally not performed until the offline analysis. A cross check of the offline analysis is provided by the time-of-flight (TOF) system, which crudely measures the vertex position using timing information. The 288 sense wires of the chamber were broken up into 18 wire subsections (3 layers of 6 wires each) for processing by the network. The sets of 18 drift times became inputs to identical MLPs each with a single hidden layer of 128 units. Each output layer had 62 units, 60 representing 1.0-cm bins from -30 to 30 cm and 2 "overflow" units. The 18 input subnetworks were trained to represent the vertex position by a gaussian histogram in the output units, which gives good vertex position resolution with relatively few output units. Training was done using real data recorded in a previous run of the E735 experiment at Fermilab (E735 1991). Targets were obtained using the Z position of the vertex calculated using the standard offline algorithm. The 18 wire subsections were chosen so as to overlap in order not to miss tracks which may span subsections. The outputs of the subnets are then simply added. This is illustrated in Figure 8. Figure 9 compares the distribution of Zofnine- Z" to that of Z0mneZmF, where Z is the position along the direction of the beam particles. The neural network Z resolution is about 3 times better than TOF, and its performance can probably be even further improved by using additional wire layers in the chamber. TOF is currently analyzed offline. It might be possible to implement it online, but its resolution can probably not be improved because it is a technology which has already been pushed to its limits. Also, the TOF technique cannot handle cases of multiple vertices. The neural net treats these in a natural way: each vertex appears as a bump in the summed net output.
~
~
~
Neural Networks in High-Energy Physics
531
Figure 8: The z-chamber sits near the beam pipe to detect outgoing charged particles whose trajectories can be used to determine the vertex position. It is divided into 18 wire subsections, each with its own MLP, whose outputs are summed to give a distribution whose peak indicates the most probable vertex position.
3.2.2 Kink Recognition. A high-energy pion or kaon will sometimes decay in a tracking chamber volume into a muon and a neutrino. The neutrino is neutral and is not seen in the tracking chamber. The muon is charged and is seen, however has a different momentum from the original particle. The result is a track that appears to have a "kink" in it (Fig. 10). In this work (Stimpfl-Abele and Garrido 1991; Stimpfl 19921, simulated pion tracks of 3, 5, and 10 GeV momentum were generated and transported through a chamber modeled on that of the Aleph experiment at CERN. A detailed detector simulation was used to model noise hits and other instrumental effects. Two approaches were tried. In the first, helical track segments are fit to the hit positions in an inner region, 1, and an outer region, 2 (Fig. 10). The 5 helix parameters* in the two regions are then used as input to an MLP, which tells whether or not this track is due to a decay. In the second approach, a single fit is done to the track across both regions, and the residuals of the fit are used as input to the neural network. There will be 42 residuals, one for each sThe helix parameters are the Z position of the vertex, the polar and azimuthal angles of the axis of the helix, the radius, and the pitch.
532
Bruce Denby
measurement along the trajectory. As a variant to this second approach, groups of three residuals were averaged to give 14 residuals as input to the network. The results, are summarized in Table 7, which also shows the network architectures tried. Also given in the table is the result obtained with the standard method for kink identification, called the analytical x2 method, in which again the track is fit in two regions and a xz is calculated from the helix parameters in the two regions to determine the probability of the nonkink hypothesis. Both of the neural net methods are found to have higher efficiency than the standard chi-squared method. The neural network residual method is about 20 times faster to calculate than the analytical x2 method, assuming that the residuals are already available from the standard track fit. 3.2.3 Other Applications. A variety of other applications of neural networks to low level pattern recognition in high-energy physics have appeared, which we mention only briefly. The interested reader may consult the references. In an application to a Cherenkov’ detector, MLPs were used to find a set of dots forming a ring pattern in a noisy image (Altherr et al. 1992; deGroot and Los 1991). In another hardware application (Haggerty 1992), a discrete component hardware MLP was used to measure, in real time, the position of a muon track in a tracking chamber using charges induced on electrodes placed below the sense wire. MLPs have been used to perform electron/pion discrimination in a calorimeter (Garlatti Costa et al. 1992; Teykal 1992)and identification of heavy quarks using the presence of multiple vertices in a vertex tracking chamber (Gupta et al. 1991; Denby 1992). Applications to charged track reconstruction will be discussed in Section 5. 4 Physics Process Determination 4.1 B Tagging. Numerous groups have used neural networks for identifylng reactions containing b quarks. This is usually referred to as ”b tagging.” Typically this has been done at the four experiments at the LEI’ electron positron colliders (Proriol et al. 1991; Prorioll992; Bortolotto et al. 1991; deGroot and Los 1991; Gottschalk and Nolty 1991; Bellantoni et al. 1991; Seidel et al. 1992; Branchini et al. 1992; Brand1 19921, although some work with simulated jets at proton/antiproton colliders has also been reported (Denby et al. 1990). B tagging is of considerable interest since the properties of many particles containing b quarks have to date not been well studied. In the LEP work, the approach is typically to choose an ensemble of feature variables which describe the spatial distribution of energy within each jet and of the event as a whole. Additional 9 ACherenkov detector measures the mass of certain types of particles using the light the particle produces in passing through a transparent medium.
Neural Networks in High-Energy Physics
- 10
-6
533
5
0
10
cm
Figure 9 (a) difference between Zvertex as measured by the neural net and by the standard offline program, in centimeters. (b) Difference between Zvertex as measured by TOF counters and standard offline program. The neural net resolution is much better. Table 7 Efficiencies (in Percent) for Correctly Identifying Kinks (Defined in Text) in Pion Tracks of 3,5, and 10 GeV Momentum! Method
3GeV
5GeV
10GeV
5-5-1 (par) 5-10-1 (par)
78.9 79.1
67.0 67.2
53.5 53.6
14-7-1 (res) 14-14-1 (res) 42-6-1 (red
79.9 80.5 80.3
65.5 65.7 67.7
51.5 53.9 54.5
Analytical x2
76.0
62.0
40.2
'Two MLP architectureswere hied for the case of track parameters as net inputs, and three for the case of fit residuals as net inputs. The results for the standard method, analytical x2,are also given.
Bruce Denby
534
kink
Figure 10: A pion decays to a muon and a neutrino to produce what appears as a track with a "kink." The kink is recognized by comparing found track parameters in region 1 and region 2. information such as from vertex tracking chambers may also be included. We choose as an example of this type of study the analysis performed by members of the Delphi experiment, which extends the analysis to charm quarks and undifferentiated light quarks in order to extract the decay probabilities into these quarks of the Z boson. This analysis is described in the next section. 4.2 Decay Probabilities of the Z. The neutral boson Z can decay into any constituent plus its anticonstituent (e.g., electron plus positron, u quark plus Ci quark). The standard model dictates the types of interactions that the constituents can undergo, but the relative strengths of the various interactions must be verified experimentally A group from the DELPHI collaboration (one of the 4 major experiments at the LEP accelerator at CERN) has recently used a feedforward neural network to classify decays of the Z into three classes: CC pairs, bb pairs, or light quark (u, d, or +antiquark pairs (Cosmo et al. 1992; De Angelis 1992; Eerola 1992; see also Bortolotto et al. 1991). This classification has permitted a measurement of the probabilities of the Z to decay into these particles to be made with higher precision than was previously possible. The probability of the Z to decay into the leptons electron, muon, and tau has been well established. That measurement is "easy" to make since these particles are relatively easy to identify in the apparatus. The case of the decay of the Z into quarks is considerably more difficult since the final state quarks fragment immediately into jets. The problem then becomes
Neural Networks in High-Energy Physics
535
deducing the type of quark involved in the decay from the properties of the jets themselves and from their distribution within the apparatus. The standard technique for distinguishing heavy quarks from light quarks is through their so-called semileptonic decays, in which a particle containing a heavy quark decays to a lepton plus other particles. This technique has two disadvantages: (1) semileptonic decays account for only 20% of heavy quark decays, therefore with this technique it will be more difficult to obtain a sample large enough to assure small statistical errors; (2) in a semileptonic decay a neutrino is also emitted; these escape detection, making it impossible to completely reconstruct the event, leading to uncertainty in quark species in some cases. A technique that allows the use of all types of heavy quark decays is thus desirable. In the DELPHI work, 19 jet and event-shape variables were created as inputs to an MLP. The variables describe the spatial distribution of energy in the jets and in the event as a whole, various kinematical combinations of the momenta of the particles in the jets, as well as information about the presence of leptons in the event. An exact description of the 19 variables is not very illuminating to the nonspecialist; the interested reader is referred to the original works. The network architecture chosen had 25 hidden units and 3 output units to encode the three classes. The training data for the network were generated with a standard physics Monte Carlo program and a program that simulates the response of the DELPHI apparatus to particle collisions. A total of 6000 training events were used. An independent set of 200,000 events was used for testing the network. The trained network was then used to determine the relative fractions of b, c, and light quark decays in a sample of 123,475 real events from the DELPHI experiment. To do this, a two-dimensional representation of the network output was devised as follows. The values of the 3 output nodes were normalized to sum to 1. Each event can then be represented as a point within an equilateral triangle where the perpendicular distances of the point to the sides of the triangle represent the values of the output nodes. This type of representation is referred to as a Dalitz plot. Figure 11 shows the distribution within this plane of Monte Carlo events for b, c, and light quark decays, as well as for the real data. The fractions were obtained by fitting the real data distribution to a linear combination of the Monte Carlo distributions for the three classes:
R ( u , v ) = (1- F , - F b ) U i ( u , v ) + F , U p ( u , v ) + F b 1 1 ~ ( U , v ) where u,v are the variables defining the plane, R is the distribution of the real data, F, and F b are the fractions of decays containing c and b quarks, respectively, and U I , 112, and 113 are the distributions of the Monte Carlo data. The results of the fit are
Fc = 0.151 f0.008stat. f0.04lSyst, F b = 0.232 fO.OOSS,t, rt 0.017,yst.
536
Bruce Denby
Figure 11: Dalitz plots used to measure the relative fractions of b, c, and light (uds) quarks in the decays of the Z".The activation of the network output node corresponding to each class is represented as the perpendicular distance from the side of the triangle opposite the corner labeled with that class. The outputs of the three nodes always sum to 1. (a,b,c) The distribution of network outputs for Monte Carlo (uds), c, and b quarks, respectively. (d) The distribution for real data from Delphi. To extract the fractions of b, c, and (uds), the distribution in (d) is fit as a linear combination of the distributionsof (a), (b), and (c),where the coefficients in the linear combination are the desired fractions. where the first error is due to statistics, the second to an incomplete knowledge of certain parameters in the Monte Carlos, and the dependence of the result on which Monte Carlo model is used. For comparison, the best result to date for Fb (Abreu et al. 1992)using semileptonic decays is F b = 0.215 f0.017stat+systematic where the systematic error contains effects due to parameter and model
Neural Networks in High-Energy Physics
537
dependence. For the charm quarks, the best result to date (Abreu et al. 1990) is obtained by identifying a characteristic low energy pion from the decay of a particle containing a charm quark. The result is Fc = 0.162
+ -0.030stat + -0.050syst.
In the case of Fc, both statistical and systematic errors are better for the neural net result than for the semileptonic decay result. In the case of Fb, although the neural network result has a very slightly larger overall error, it is obtained with significantly lower statistics. This is because the neural net analysis allows all the data to be used in the analysis, while in the standard analysis, only the rarer semileptonic decays can be used. The neural network approach also has the advantage of providing a heavy quark probability on an event by event basis, whereas the semileptonic decay technique relies on global distributions for the entire data set. 4.3 QuarWGluon Separation. The ability to distinguish quark jets from gluon jets is clearly very desirable. The W and Z decay 80% of the time to two quarks, but normally these decays are unusable since it is not possible to distinguish these jets from the more copiously produced gluon jets. Furthermore, the most probable decay mode of the much sought top quark is into three quark jets, but this channel has long been considered unusable due to high backgrounds from multigluon final states. The ability to venfy three quark jets would dramatically reduce the background. Distinguishing quark jets from gluon jets has been thought by many high-energy physicists to be impossible due to the high degree of similarity between the two types of jets. Separation of quark and gluon jets using neural networks has been treated in a number of references (Lonnblad et al. 1990, 1991a; Bhat et al. 1990; Csabai et al. 1991; Baer et al. 1991; Barbagli et al. 1992). These results have been almost exclusively based upon data generated by Monte Carlo. Recently a new result from the Fermilab Tevatron collider has appeared (Bianchin et al. 1992a,b), which for the first time appears to give evidence of quark and gluon components in real jets produced in proton/antiproton collisions. In the Fermilab result, jets identified in the apparatus are represented by a set of 8 feature variables which describe the spatial distribution of energy within the jets, for example, the amount of energy contained within each of three concentric cones centered on the centroid of the jet, the rms width of the jet, etc. A backpropagation neural network with these 8 variables as inputs was trained to separate quark jets from gluon jets based on examples generated by Monte Carlo. It is necessary to use Monte Carlo since pure samples of quarks and gluons do not exist. The real data will always contain a mixture of quark and gluon jets, and in fact the relative ratio of quarks and gluons in various kinematical regions is one of the sought after results. For this reason, this problem too will suffer from the fact that the results will depend on which model of quark and gluon fragmentation has been used.
Bruce Denby
538
350
240
300
200
250 1 60
200 120
150
80
100
50
40
0
0
0.25
0.5
0.75
Pythia CC Et.gt.60
Pythio 00 Et.gt.60
900
0
C
Jet40 Et.gt.60
Figure 12: Output of 8-6-1 MLP for (a) Monte Carlo (the Pythia Monte Carlo was used in these studies) quarks, (b) Monte Carlo gluons, and (c) Real data from the CDF experiment (labeled “Jet 4 0 ) . All the jets are required to have Er greater than 60 GeV. The real data appear to be predominantly gluon-like with a small admixture of quarks, as expected from theoT.
There is considerable overlap of the two classes in all of the feature variables, and none is adequate to provide a useful classification of the jets. Figure 12 a and b shows the output of the trained neural network on independent test samples of Monte Carlo quarks and gluons. The quark and gluon distributions overlap substantially: quark and gluon jets are indeed very similar! However the separation achieved is useful because quark or gluon enriched samples can now be produced by placing cuts on the output of the neural network.
Neural Networks in High-Energy Physics
539
A study was made of the efficiency of the network, defined as the fraction of quark jets with network output above 0.5, as a function of the number of nodes in the hidden layer. Performance did not improve beyond the results with two hidden units, and in fact a simple perceptron (no hidden units) was only a few percentage points worse using this measure. However, the network output distributions for the zero and two hidden unit cases was much more gaussian in shape and did not include any regions in which the quark-to-gluon ratio was high. Such regions may prove valuable for placing cuts which enrich quark-to-gluon ratio at the price of reduced quark efficiency. For this reason, the results from the 6 hidden unit network were retained for the final analysis. The maximum efficiency achieved on the Monte Carlo data was 70%. Figure 12c shows the result of applying the trained net to a sample of real data from the CDF experiment. The real data distribution appears to be predominantly gluon-like with a non-zero admixture of quarks, which is consistent with the result expected on theoretical grounds for events in the kinematical regime in which the data was taken. A fit to the real data as a linear combination of the Monte Carlo quark and gluon distributions gives a good x2, but because of model dependence and some subtleties in the Monte Carlo programs, it has not yet been possible to extract the exact quark fraction from this distribution in an unambiguous way. However, the results are encouraging and work is continuing. More recently, another analysis was performed (Bianchin et al. 1992a) in which a feature map was trained on a sample of mixed Monte Carlo quarks and gluons and then used to identify quarks and gluons in an independent sample. A somewhat higher efficiency, about 72 percent was obtained. The feature map trained on Monte Carlo is also being applied to the real data, and, conversely, a feature map trained on real data is being applied to labeled Monte Carlo data. Training using only real data is very attractive since it avoids the problem of model dependence, although it may be necessary to use the Monte Carlo data to label the nodes in the topological map. These analyses are still in progress. 4.4 Additional Physics Process Applications. The use of learning vector quantization (LVQ) and topological maps is relatively new in HER An interesting application of topological maps appears in (Lonnblad et al. 1991b) in which a map is used to discover the b, c, and light quark classes in a sample of mixed Monte Carlo data. A similar application is being attempted for data at the Tevatron (Bianchin et al. 1992b). LVQ has been used for b tagging (Proriol et al. 1991; Proriol 1992) and discrimination of Tf events from background (Odorico 1991). Other MLP offline applications include resonance searches'O (Alexopoulos 1991), calculation ~~
''A resonance is a bound state of two or more particles and appears as a peak in a mass distribution.
540
Bruce Denby
Figure 13: Neuron links in the Denby-Peterson net. of the total mass of the particles in an event (Lonnblad et al. 1991b1, determination of the charge of the initial quark which produced a jet (Varela and Silva 1991), and identification of jet cascades with muons (Los 1992). 5 Neural Nets and Charged Track Reconstruction
5.1 Tracking with Recurrent Nets. Recurrent networks have been used in HEP for track reconstruction, using an algorithm developed by Denby and independently by Peterson (Denby 1988; Peterson 1989; Stimpfl-Abele and Garrido 1990; Denby and Linn 1990; Barbagli 1992). In this application a neuron is defined to be a directed link between two hits in a tracking detector. The approach resembles qualitatively the encoding used by Hopfield (Hopfield and Tank 1986) for solving the Traveling Salesman Problem with a recurrent net. The weight connecting two neurons i and j is determined by the angle Bi, between them (Fig. 13):
where li and l, are the lengths of the neurons (i.e./ distance between hits), wij = - B if i and j are head-to-head or tail-to-tail. An energy function is defined, E = -1 /2 C WijOiOj, where oi is the output of neuron i. The energy function will be smallest when the angles between close together neurons sharing points are small. This favors neurons lying along smooth trajectories such as those of particles moving in a magnetic field. The constraint term -B ensures a unique direction to the tracks to avoid a degeneracy that prevents settling of the network. The evolution of the system is if i and j do not both point into or out of the same point, and
Neural Networks in High-Energy Physics
541
Figure 1 4 Charged track reconstruction on real data in the Aleph central tracking chamber, using a recurrent neural network algorithm. In the top figures the beam pipe is perpendicular to the plane of the page, in the bottom figures, horizontal. The left-hand frames show the neuron links before evolution; at right are the found tracks at the end of evolution.
obtained by iteratively solving the update equations:
rduildt =
wVoi - ui;
oi = sigmoid(u;)
i
On each iteration, dt is kept much smaller than
7,the time constant of the system. This method has been used on real data at the ALEPH experiment at LEP (Stimpfl-Abele and Garrido 1990). Figure 14 shows r-phi (i.e., looking down the beam line) and r-z (side) views of an event in which a Z boson decays to hadrons, with all links defined before network evolution (left side of figure), and the event after settling of the network, with tracks found (right side). The efficiency is as good as the conventional
542
Bruce Denby
track reconstruction program but the neural net algorithm is somewhat faster. In this work, a study was made of execution time for the neural net and conventional algorithms as a function of track multiplicity (number of charged tracks in the event). The advantage of the neural algorithm is shown to increase with multiplicity. Although this type of algorithm has not yet been accepted as a standard track recognition algorithm, it may prove to be important in the future when track multiplicities will be larger. There is not a straightforward way to implement this algorithm in the fast hardware that would be needed to make it applicable at the trigger level, since the number of neurons and weights is high, and the weights must be recalculated for each event. In addition, the algorithm does not take advantage of all the available information, such as that tracks in a uniform magnetic field are known to be nearly perfect helices. This makes the algorithm more susceptible to noise since it will be less able to reject noise hits that happen to lie near the tracks. 5.2 Elastic Tracking. Improvement to the neural tracking are the socalled elastic tracking (Gyulassy and Harlander 1991) or deformable templates (Ohlsson et a/. 1991) approaches. In these approaches, a track is a helical object that settles into a shape which best fits the hits. The helix can be thought of as electrically charged and attracted to the hits which have opposite charge. Although these algorithms map the tracking problem onto dynamical systems, and are at least in principle parallelizable, they have lost some of the “neural” flavor of the original DenbyPeterson net. Nonetheless, the efficiency and robustness to noise of the elastic methods are excellent. One interesting study (Gyulassy and Harlander 1991) compared the robustness to noise of the standard method, the Denby-Peterson net, and the elastic tracking method. The standard method of track reconstruction is called the “roadfinder” since it starts with two nearby hits and then searches for additional hits on a ”road” in the direction of the segment joining them. Figure 15 from this study shows the efficacy of each method as a function of number of tracks. All data have 20% noise and 3% error on position measurement. The roadfinder breaks down between 5 and 10 tracks, the Denby-Peterson net at 10-15 tracks, but the elastic tracking always finds the correct answer in this study.
6 Conclusion
Five years ago there was no explicit mention of neural network techniques in HEP literature. A current bibliography of applications in HEP includes almost a hundred papers and reports. Much of the work is still exploratory and uses only the simplest techniques such as the MLP
Neural Networks in High-Energy Physics
543
Figure 15: Comparison of track reconstruction performance for the standard method (roadfinder),Denby-Peterson net, and elastic tracking.
trained with backpropagation, although some interesting results using learning vector quantization and feature maps have also appeared. In HER historically, data analysis has been done using simple onedimensional cuts; consequently the HEP community at large has yet to fully accept neural network techniques as standard tools. Nevertheless the neural network methods are beginning to show their worth. The decay probabilities of the Z boson into b, c and light quarks has been measured with higher precision than ever before using a technique based on an MLP. A neural network technique has given higher kink finding efficiency and faster execution speed than the standard method. Results consistent with identification of quark and gluon components in jets pro-
544
Bruce Denby
duced at a proton/antiproton collider have appeared for the first time using a feedforward neural network. Recurrent networks have provided a faster way of performing charged track reconstruction. One of the most exciting promises of neural network technology is in the realm of triggering for HEP. One test has already been completed: a VLSI neural network used in the data acquistion system of a drift chamber has provided, in only a few microseconds, track intercept resolution 50 times more accurate than that previously obtainable online. Neural network triggers for three large collider experiments are either currently installed or have been proposed for future installation. The neural network triggers will permit experiments to reject more background events earlier in the data stream, resulting in more efficient and cost-effective data acquisition systems and reduced data storage requirements. It is intellectually quite stimulating to witness a marriage between such seemingly disparate domains as high-energy physics and neural networks. Given the growth of applications and their success to date, HEP may turn out to be one of the driving forces in the integration of neural networks into science as data analysis tools.
References Abreu, P., et al. 1990. Phys. Lett. 2528, 140-148. Abreu, P.,et al. 1992. Measurement of the partial width of the Z" into bb final states using their semileptonic decays. CERN-PPE/92-89. Zeit. Phys. C 56, 47-62. AIHEP 1992. Proceedings of the Second International Workshopon Software Engineering, Artificial Intelligence, and Expert Systems for Nuclear and High Energy Physics, La Londe les Maures, France, January, World Scientific. Alexopoulos, T. 1991. Resonance searches using a neural network technique. Talk at DPF 91, Vancouver, Canada, August 1991, submitted to proceedings. Alexopoulos, T. 1991. Ph.D. Thesis, University of Wisconsin, unpublished. Altherr, T. 1992. Cerenkov ring recognition using adaptable and non-adaptable networks. New Computing Techniques in Physics Research, 11, D. Perret-Gallix, ed.,World Scientific. Amendolia, S. R., et al. 1990. Study of a fast trigger system on beauty events at fixed target and colliders. Nucl. Instrum. Methods A289, 539. Badgett, W., Burkett, K., Campbell, M., Wu, D. Y., Denby, B., Lindsey, C., Blair, R., Kuhlmann, S., and Romano, J. 1992. A neural network calorimeter trigger used in CDF. Proceedings of the 2992 IEEE Nuclear Science Symposium, Orlando, Florida. Baer, H., Karatas, D., and Giudice, G. 1991. Snagging the top quark with a neural network. FSU HEP 911130, Florida State University, Tallahassee, November. Barbagli, G., DAgostini, G., and Monaldi, D. 1992. Quark/gluon separation in
Neural Networks in High-Energy Physics
545
the photoproduction region with a neural network algorithm. Universita di Roma 'La Sapienza', Internal note N.992, February. Bellantoni, L., Conway, J. S., Jacobsen, J. E., Pan, Y. B., and Wu, Sau Lan, 1991. Using neural networks with jet shapes to identify b jets in e+e- interactions. CERN-PPE/91-80,24 May. Nucl. Instrum. Methods, A310: 618-622. Bhat, P., Lonnblad, L., Meier, K., and Sugano, K. 1990. Using neural networks to identify jets in Hadron Hadron collisions. Proceedings of the 1990 Summer Study on High Energy Physics-Research Directions for the Decade, Snowmass, Colorado, June 25-July 13. Bianchin, S., Denardi, M., Denby, B., Dickson, M., Pauletta, G., Santi, L., and Wainer, N. 1992a. Classification of jets from PPbar collisions at Tevatron energies. New Computing Techniques in Physics Research, 11, D. Perret-Gallix, ed., World Scientific. Bianchin, S., Dall'Agata, M., De Nardi, M., Pauletta, G., Santi, L., Denby, B., Wainer, N., and Dickson, M. 1992b. Jet classification at CDF. Proceedings of the Second Workshop, Neural Networks: from Biology to High Energy Physics, Elba International Physics Center, Isola dElba, Italy, 18-26 June. Int. I. Neural Syst., in press. Bock, R. K., et al. 1990. Feature extraction in future detectors. Nucl. Instrum. Methods A289, 534. Bortolotto C., Cosmo, G., DeAngelis, A., Linussio, A,, Eerola, P., and Kalkinnen, J. 1991. A measurement of the partial Hadronic widths of the Zo using neural networks. Proceedings of the Workshop Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d'Elba, Italy, June 5-14, ETS Editrice, Pisa. Branchini, I?, Ciuchini, M., and Del Giudice, P. 1992. B tagging with neural networks: An alternative use of single particle information for discriminating jet events. N a o Computing Techniques in Physics Research, 11, D. Perret-Gallix, ed., World Scientific. Brandl, B., Falvard, A., Henrard, P., Jousset, J., and Proriol, J. 1992. Tagging of Z decays into heavy quarks in the Aleph detector using multivariate analysis methods: Neural networks, discriminant analysis, clustering. New Computing Techniques in Physics Research, 11, D. Perret-Gallix, ed., World Scientific. CDF 1988. The Collider Detector at Fermilab, a compilation of articles reprinted from Nucl. Instrum. Methods A, North Holland, Amsterdam. Cosmo, G., De Angelis, A., De Groot, N., Del Giudice, P., Eerola, P., Kalkkinen, J., Lyons, L., Los, M., Torassa, E., and Vallazza, E. 1992. Delphi Collaboration. Classification of the Hadronic decays of the Zo into b and c quark pairs using a neural network. XXVl Znternational Conference on High Energy Physics, Dallas, TX, August 5-12, submitted. Crosetto, D., and Love, L. 1992. Fully pipelined and programmable level 1 trigger. SSCL-576, SSC Laboratory, Dallas, Texas, July. Csabai, I., Czako, F., and Fodor, Z. 1991. Combined neural network-QCD classifier for quark and gluon jet separation. CERN Preprint CERN-TH.6038/91 and Eotvos University (Budapest) Institute for Theoretical Physics preprint ITP-Rep. Budapest 483, March.
546
Bruce Denby
De Angelis, A. 1992. Heavy flavour identification in Delphi. Proceedings of the Second Workshop, Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June, 1992. Int. 1. Neural Syst., in press. De Groot, N., and Los, M. 1991. B-tagging in Delphi with a feed-forward neural network. Proceedings of the Workshop Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, June 5-14, ETS Editrice, Pisa. Denby, B. 1988. Neural networks and cellular automata in experimental highenergy physics. Computer Phys. Commun. 49,429-448. Also, Denby, B. 1988. Neural network and cellular automata algorithms. Florida State University preprint FSU-SCRI-88-141, June. Tallahassee, Florida. Denby, B., Campbell, M., Bedesch, F., Chriss, N., Bowers, C., and Nesti, F. 1990a. Neural networks for triggering. IEEE Trans. Nucl. Sci. 37(2), 248-254. Denby, B., Lessner, E., and Lindsey, C. S. 1990b. Proceedings 2990 Conference on Computing in High Energy Physics, Santa Fe, NM, AIP Conf. Proc. 209, 211. Denby, B., and Linn, S. 1990. Computer Phys. Commun. 56, 293-297. Denby, B., Franklin, M., Kim, S. H., Konigsberg, J., and Timko, M. 1991. CDF Internal Note 1538. Proposal for a level-2 isolated plug electron trigger for the 1991/1992 run. CDF Collaboration, Fermi National Accelerator Laboratory, Batavia, Illinois. Denby, B. 1992. Quark flavor sensitivity of the mammalian cortex: Theoretical foundations. Proceedings of the Second Workshop,Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Int. J. Neural Syst., in press. DO Design Report. 1983. Fermilab, December; DO Upgrade. 1991. Fermilab P-823, April. Turkot, F., et al. 1991. Nucl. Phys. A525, 165-170. Eerola, P. 1992. Classification of the Hadronic decays of the Zo into b and c quark pairs using a neural network. Proceedings of the Second Workshop, Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Int. J. Neural Syst., in press. Farber, R. M., Kinnison, W., and Lapedes, A. S. 1991. A parallel non-neural trigger tracker for the SSC. LA-UR-91-607, Los Alamos National Laboratory, Los Alamos, NM. Fortner, M. 1992. Analog neural networks in an upgraded muon trigger for the DO detector. New Computing Techniques in Physics Research, II, D. Perret-Gallix, ed.,World Scientific. Garlatti Costa, P., De Angelis, A., Lanceri, L., Santi, L., Vignaduzzo, C., and Zoppolato, E. 1992. A neural network for e / p classification in a calorimeter. INFN Sezione di Trieste, Italy, technical note INFN/AE-92/14,27 April. Gottschalk, T. D., and Nolty, R. 1991. Identification of physics processes using neural network classifiers. Caltech Report CALT-68-1680. Gupta, L., Upadhye, A., Denby, B., and Amendolia, S. R. 1992. Neural network trigger algorithms for heavy quark selection in a fixed target high-energy physics experiment. Pattern Recog. 25, 413-421.
Neural Networks in High-Energy Physics
547
Gyulassy, M., and Harlander, M. 1991. Elastic tracking and neural network algorithms for complex pattern recognition. Computer Phys. Commun. 66, 31-46. Haggerty, H. 1992. Fermilab, private communication. Hansen, J. R. 1992. The need for neural networks at LHC and SSC. Proceedingsof the Second International Workshop on Software Engineering, Artificial Intelligence, and Expert Systems for Nuclear and High Energy Physics, La Londe les Maures, France, January, World Scientific, in press. Hopfield, J., and Tank, D. W. 1986. Science 233, 625. Intel. 1991. 80170NX Electrically Trainable Analog Neural Network, Intel Corporation, Santa Clara, California. Lea, R. M. 1988. ASP A cost effective parallel microcomputer. IEEE Micro, October. Lindsey, C. S. 1992. Drift chamber tracking with a VLSI neural network. Proceedings of the Second Workshop, Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Int. J. Neural Syst., in press. Lindsey, C. S. 1991. Tracking and vertex finding in drift chambers with neural networks. Proceedings of the Workshop Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, June 5-14, ETS Editrice, Pisa. Lindsey, C. S., and Denby, B. 1991. Nucl. Instrum. Methods A302, 217. Lindsey, C. S., Denby, B., Haggerty, H., and Johns, K. 1992. Nucl. Instrum. Methods A317, 346-356. Lonnblad, L., Peterson, C., and Rognvaldsson, T. 1900. Finding gluon jets with a neural trigger. Phys. Rev.Lett. 65,1321-1324. Lonnblad, L., Peterson, C., and Rognvaldsson, T. 1991a. Using neural networks to identify jets. Nucl. Phys. 8349, 675-702. Lonnblad, L., Peterson, C., and Rognvaldsson, T. 1991b. Mass reconstruction with a neural network. Lund University preprint LU TP 91-25, October. Phys. Lett. B, submitted. Lonnblad, L., Peterson, C., Pi, H., and Rognvaldsson, T. 1991~.Self organizing networks for extracting jet features. Computer Phys. Commun. 67,193-209. Los, M.1992. Using a neural network for classifymg jet cascades with a muon. Proceedings of the Second Workshop,Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Int. J. Neural Syst., in press. Cherubini, A., and Odorico, R. 1991. Identification by neural networks and statistical discrimination of new physics events at high energy colliders. Proceedings of the WorkshopNeural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, June 5-14, ETS Editrice, Pisa. Ohlsson, M., Peterson, C., and Yuille, A. 1991. Track finding with deformable templates-The elastic arms approach. Lund University Preprint LU TP 9127, November, Lund, Sweden. Track finding with neural networks. Computer Phys. Commun., submitted.
548
Bruce Denby
Peterson, C. 1989. Nucl. Instrum. Methods A279, 537. Proriol, J., et al. 1991. Tagging B quark events in Aleph with neural networks. Proceedingsof the Workshop Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, June 5-14, ETS Editrice, Pisa. Proriol, J. 1992. Tagging B quark events in e+e- colliders with neural networks. Comparisons of different sets of variables and different methods. Proceedings of the Second Workshop, Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Int. J. Neural Syst., in press. Ramacher, U., Beichter, J., Raab, W., Anlauf, J., Bruls, N., Hachmann, U., and Wesseling, M. 1991. Design of a first generation neurocomputer. In VLSZ Design of Neural Networks, U . Ramacher and U. Riickert, eds., pp. 271-310. Kluwer Academic Publishers. Ribarics, P., et al. 1991. Neural network trigger in the H1 experiment. Proceedings of the Workshop Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, June 5-14, ETS Editrice, Pisa. Ribarics, P. 1992a. Neural network level 2 trigger in the H1 experiment. New Computing Techniques in Physics Research, ZI, D. Perret-Gallix, ed., World Scientific. Ribarics, P. 1992b. Neural network trigger in the Hl experiment. Proceedings of the Second Workshop, Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Znt. J. Neural Syst., in press. SDC. 1992. The SDC Tech. Design Rept., SSCL-SR-1215, SSC Laboratory, Dallas, TX, 1 April. Seidel, F., et al. 1992. Extensive studies on a neural networks for b tagging and comparisons with a classical method. New Computing Techniques in Physics Research, 11, D. Perret-Gallix, ed., World Scientific. Stimpfl-Abele, G., and Garrido, L. 1991. Fast track finding with neural nets. Computer Phys. Commun. 64,46-56. Stimpfl-Abele, G., and Garrido, L. 1991. Recognition of decays of charged tracks with neural network techniques. Computer Phys. Commun. 67, 183-192. Stimpfle-Abele, G. 1992. Neural nets for kink finding. Proceedings of the Second Workshop, Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Int. J. Neural Syst., in press. Teykal, H. 1992. Using neural networks for the identification of electrons and pions in a calorimeter for high energy physics. Proceedings ofthe Second Workshop, Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, 18-26 June. Int. 1.Neural Syst., in press. Silva, l?, and Varela, J. 1991. Identification of the quark jet charge using neural networks. Proceedings of the Workshop Neural Networks: From Biology to High Energy Physics, Elba International Physics Center, Isola d’Elba, Italy, June 5-14, ETS Editrice, Pisa.
Neural Networks in High-Energy Physics
549
Vermeulen, J. 1992. A study of the feasibility of using neural networks for second level triggering at LHC. Neul Computing Techniques in Physics Research, 11, D. Perret-Gallix, ed., World Scientific. Wu, D., et al. 1990. CDF Internal Note 1310, A pattern recognition level-2 B trigger at CDF in 1991. CDF Collaboration, F e d National Accelerator Laboratory, Batavia, Illinois; and private communication. Received 5 August 1992; accepted 15 December 1992.
ARTlCLE
Communicated by Christof Koch
Stimulus-Dependent Synchronization of Neuronal Assemblies E. R. Grannan. D. Kleinfeld A T 6 T Bell Laboratories, Murray Hill, N1 07974 U S A
H.Sompolinsky Racah Institute of Physics and the Center for Neural Computation, Hebrew University, lerusalem, 91904 Israel, and A T 6 T Bell Laboratories, Murray Hill, NJ07974 U S A
We study theoretically how an interaction between assemblies of neuronal oscillators can be modulated by the pattern of external stimuli. It is shown that spatial variations in the stimuli can control the magnitude and phase of the synchronization between the output of neurons with different receptive fields. This modulation emerges from cooperative dynamics in the network, without the need for specialized, activitydependent synapses. Our results further suggest that the modulation of neuronal interactions by extended features of a stimulus may give rise to complex spatiotemporal fluctuations in the phases of neuronal oscillations. 1 Introduction A ubiquitous feature of the brain is the presence of widespread, rhythmic patterns of neuronal activity (Ketchum and Haberly 1991). One aspect of this activity is gamma oscillations in the visual cortex, with a frequency near 40 Hz, that are evoked by an external stimulus (Bouyer et al. 1981; Freeman and van Dijk 1987; articles in Schuster 1991). Singer and Gray and co-workers (Gray et al. 1989; Engel et al. 1991a,b) and Eckhorn et al. (1988) showed that the timing of these oscillations in a region of cortex can, under certain circumstances, be influenced by visual stimuli that lie outside its receptive field. In particular, the oscillations in regions with nonoverlapping receptive fields are synchronized when the direction of motion and orientation of stimuli presented to the individual fields are similar. Conversely, dissimilarity in these features results in a failure to synchronize. These results suggest that temporal coherence may be used to encode features of objects in multiple receptive fields. 'Present address: Department of Physics, Physical Sciences 2, University of California, Irvine, CA 92717.
Neural Computation 5, 550-569 (1993) @ 1993 Massachusetts Institute of Technology
Synchronizationof Neuronal Assemblies
551
Temporal synchrony across the visual cortex (Engel et al. 1991a) is most likely mediated by long-range axonal projections within the cortex. These long-range axonal projections appear to connect neurons with different receptive fields but similar orientation preference (Gilbert and Wiesel 1989). However, the mechanism by which the stimulus gates the influence of the long-range connections, and thereby modulates the synchrony of the oscillations, is unclear. Here we consider theoretically how spatial variations in an extended stimulus can modulate the interactions in a network of coupled neuronal assemblies. Each assembly consists of analog neurons that are extensively interconnected and produce oscillatory output as a result of inhibitory feedback. Neurons in different assemblies are coupled by relatively weak, feature-specific excitatory connections. The strength of all connections are fixed. Our work was motivated by the results of computer simulations (Sporns et al. 1989; Schillen and Konig 1991; Konig and Schillen 1991; Wilson and Bower 1991) and analytical studies (Schuster and Wagner 1991a,b; see also Aertsen ef al. 1989) that suggest that the magnitude and possibly the phase of the interaction between assemblies can be modulated by the stimulus. Other studies, however, suggested that these effects are insufficiently strong (Sporns ef al. 1991). Stimulus-dependent synchronizing interactions were postulated ad hoc in a previous theoretical study of a network of phase oscillators (Sompolinsky et al. 1990, 1991). It was shown that the emergent synchronization in the network was modulated by the extended properties of the stimulus in a manner similar to that found in experiments. In the present study we derive the form of the synchronizing interactions between the assemblies, and their dependence on spatial variations in the stimulus, from the full dynamics of the network. 2 Model
Our model describes the cooperative behavior of a network of weakly coupled clusters of neurons. Each cluster is analogous to a hypercolumn in primary visual cortex (e.g., Douglas and Martin 1990). It consists of neurons that respond to the presence of a stimulus within its receptive field. For simplicity, we limit ourselves to neurons whose response depends only on a single feature of the stimulus, namely the orientation of an edge. The time averaged value of this response has a pronounced peak at a particular orientation, referred to as the preferred orientation of the neuron. We assume that the preferred orientations are uniformly distributed among different neurons within each cluster. Each cluster contains two types of neurons. One, the excitatory cells, makes only excitatory connections on its postsynaptic target while the second, inhibitory cells, make only inhibitory connections. In our architecture, all
552
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
Figure 1: Schematic of the architecture of a network with two clusters. The open circles represent neurons that form only excitatory connections and filled circles represent neurons that make solely inhibitory connections. Only a representative fraction of the total number of connections is drawn. of the inhibitory neurons are equivalent and can be replaced by a “global” inhibitory cell. There are extensive connections between neurons in the same cluster but only sparse connections between neurons in different clusters. The architecture of a network with two clusters is shown in Figure 1. The dynamics of the network is described by circuit equations (Wilson and Cowan 1972;Amari 1972;Hopfield 1984):
Synchronizationof Neuronal Assemblies
553
where Z)R(B, t ) and vR(e,t ) are the potential and output, respectively, of the excitatory neuron with orientation preference 0 in the Rth cluster, N is the number of excitatory neurons in a cluster, U R ( f ) and &(t) are the potential and output of the inhibitory neuron, and time, t, is normalized by the neuronal time constant. The output of each neuron may be interpreted as its instantaneous rate of firing. It depends on the value of the potential through a nondecreasing function, which we assume is the same for all neurons:
vR(e,t ) = g[vR(d,t)]
and
UR(f)= g [ U R ( f ) ]
(2.2)
The form of g ( x ) is taken to be the logistic function g(x) = 1/{1 + exp [-4p(x - X O ) ] } , where P and xo correspond to the gain and threshold parameters of the neurons, respectively. The parameters JE and 11 denote the strength of the synapses from excitatory and inhibitory presynaptic neurons, respectively, to excitatory postsynaptic neurons within the same cluster. The parameter YE denotes the strength of the synapses from excitatory neurons to the inhibitory one. We have assumed that these parameters do not depend on the orientation preference of the the pre- and postsynaptic neuron. On the other hand, synapses between the excitatory neurons of two different clusters occur only between neurons with similar orientation preference. These synapses have strength E and their spatial dependence is specified by the function K(R - R’) = K(IR - R’l). The external input consists of two components. A time-independent part, I R ( e ) , encodes the orientation of a single edge within the receptive field of the Rth cluster. It is of the form I R ( e ) = I(le - eO(R)I)
(2.3)
where Bo(R)denotes the orientation of the stimulus and both 0 and 00 lie between 0 and K. For simplicity, only the excitatory neurons are taken to have external input. Temporal fluctuations in the input are denoted by a noise term, &(t). We have assumed that the noise is uniform within each cluster but varies between clusters (Sompolinsky et al. 1991). This noise competes with the interactions between different clusters and tends to destroy their relative synchrony. It is taken to be a gaussian variable with zero mean and variance (&(t)[Ri(f‘)) = ~ E &R’ T 6(t - f’)
(2.4)
where T is the strength of the noise relative to that of the intercluster connections. A basic assumption in our model is that the neurons within a cluster interact strongly among each other while the interaction between neurons in different clusters is weak. This implies that the value of E is small compared to that of J E , JL, and -1,. For this condition, we expect that the activity of a neuron is determined primarily by its connections to
554
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
neurons in the same cluster and by the stimulus within its receptive field. The dominant effect of the connections between clusters is to modulate the synchrony, that is, the relative phase, between neuronal activities in separate clusters. In our analysis of the model, we consider first the dynamics of a single cluster and ignore the interactions between clusters as well as the noise. We then derive the effective interaction between pairs of clusters, including the effects of noise, that results from their long-range connections. 3 Single Cluster Dynamics
The equations that describe a single cluster are (the label R is suppressed)
where V ( t )= (V(0,t ) ) e is the excitatory output averaged over all orientations, with (. . .)o = 10" dO/n. It is also useful to define the average value of the excitatory potential, u ( t ) = (v(0,t ) ) O , and the average value of the external stimulus, I = ( I ( 0 ) ) e . The potential of each excitatory neuron depends explicitly on its orientation preference only through the external stimulus (equation 3.1). Hence, neglecting transients, it can be expressed by
v(e,t ) = v(t) + q e ) - I
(3.2)
The mean output of the excitatory neurons can then be related to the mean excitatory potential by an instantaneous gain function, that is, V ( t )= G[v(t)], where G(u) = (g(u + I ( @ - I ) ) e
(3.3)
The equations for the average potentials v( t) and u(t) are found by averaging the equations for the cluster over all orientations (equation 3.1): (3.4) The dynamics of an isolated cluster is particularly simple in that all of the excitatory neurons have the same input except for a stationary, external contribution that depends on the orientation preference of the neuron (equation 3.2). Figure 2a shows the state diagram for the output (equation 3.4) in terms of the values of two parameters, the inhibitory synaptic strength JI and the average external input I, with all other parameters held constant. The value of the neuronal gain parameter /Iwas chosen to be large. In this limit the stable fixed points, whenever they exist, correspond to either
Synchronization of Neuronal Assemblies
555
an "OFF" state or an " O N state. In the "ON state all of the neurons are firing near their saturation level, that is, v(0,t) = v(0) x 1, while in the "OFF" state all of the neurons are essentially quiescent. In the region marked "ON + OFF", the network is stable in either state and the behavior depends on the initial condition. In the "OSC" region, almost all initial conditions will lead to oscillatory outputs, while in the "OSC + O N region, depending on the initial condition, the outputs will either oscillate or remain constantly active. An example of the output for an oscillatory state is displayed in Figure 2b. All the neurons oscillate with the same frequency, as implied by equation 3.2, but their average firing rates differ. The excitatory neurons with the greatest external input, that is, 0 x 60, are the first to fire within a cycle. They are followed by the neurons with weaker external input and the inhibitory neuron. The output of the inhibitory neuron gradually quenches the activity in the network until the external input again charges the excitatory neurons to a potential above their threshold level. Then the cycle begins anew. An important characteristic of a cluster is the tuning curve, that is, the average firing rate of an excitatory neuron, V ( 0 )= (V(0,t))t,where . ) t denotes an average over time. This quantity is a function of the difference between 0 and the orientation of the stimulus, 00. An evaluation of V ( 0 ) for different values of 60 yields the tuning curve shown in Figure 2c. For the particular form of the input we chose, the activity of the neuron is essentially zero for 10 - 001 > 40". (a
4 Phase Description of Interacting Clusters 4.1 Phase Equations. The dynamics of a network of interacting clusters can be greatly simplified in the limit that the intercluster coupling strength is small, that is, 6 << 1. First, equation 2.1 can be reduced to a set of closed mean-field equations that involve only the average excitatory and inhibitory potentials of the clusters, vR(t) and U R ( ~ ) respectively, , and the average input, I (Appendix A). Second, it can be shown (Winfree 1980; Kuramoto 1984) that the average potentials are of the form
(4.1)
where G(t)and u i ( t ) are the limit cycle solutions for the unperturbed cluster (equation 3.41, w is the frequency of the neuronal oscillation, and $ R ( t ) is the phase of the oscillation. In the absence of intercluster couplings, each of the $ R ( t ) are arbitrary constants whose values lies between 0 and 2n. The presence of a small coupling between the clusters induces temporal variations in the phases that are slow compared with the period
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
556
of the unperturbed limit cycle, that is, &(t) described by phase equations of the form
0
I
I
M
O(6). These variations are
U
7
I
I
a
ON
ON
+
i
OFF
G z W
0: I-
-10
u,
> U
!m?
I
\
+
ON
\
OFF
z
-20
n
t
osc
-2 0 AVERAGE EXTERNAL INPUT, I
1.0
m-
Y
' 0t
3
0
2
0.4
u I
Q
Y
>
0 1 2 -goo TIME,t (Units of 2 7 ~ 1 ~ )
O0 e-8,
goo
Synchronization of Neuronal Assemblies
557
where rRp($R - ll,~,)represents the pair-wise interaction between the phases $R(f) and ll,Rt(f), and UR(f) is a gaussian noise that originates from &(t) and has a variance with an amplitude of = T (see equation 2.4). The dynamics of the phases depends on the intercluster interaction Fm,(ll,R - l l , ~ ~ ) . 4.2 The Form of the Interaction between Phases. The interaction l?mt(ll,R- ll,p) is a periodic function, with period 2.rr/w. Its form depends on the structures of the unperturbed limit cycles of the Rth and R'th clusters. In our model the clusters are identical except for the orientation of their respective stimuli, &(R) and d0(R'). Thus I'm,(ll,R - ll,~,)depends on R and R' only through the relative orientation A00 = I&(R) - Bo(R')I. This implies that rmt($R - l l , ~ t ) = r(All,;A&), where All, = l l , ~- l l , ~ ~ . We have derived numerically, using the method of Kuramoto (19841, the form of F(All,;A&) from the mean-field equations for the potentials of the clusters (Appendix A). The results are shown in Figure 3a. Several features of r(All,; AOo) are apparent. First, it vanishes for AB0 > &, where
~
Figure 2 Facing page. Aspects of the dynamics of a single cluster. (a) A state diagram of the output of the network. The fixed parameters for the network were JE = 15, J E ~= 12, xo = 1.1, /3 = 3, T = 0, and E = 0 and the stimulus profile was 1(0 - 00) = Il + (Ih - II)l0 - e01/(~/2)with - It = 3.5. The value of Jr and the average input I = 1/2(4 + 11) were varied. The boundaries of existence of the different states were determined from numerical simulations of equations 3.3 and 3.4. However, since 0 is large, the boundaries for the " O N or "OFF"states are approximately straight lines that can be determined from equation 3.1 by a consistency analysis. In an "OFT" state the maximum potential of the excitatory neurons must be less than XO. Since v ( 0 ) = I ( 0 ) in this state, we require maximum[I(0)]< XO, which yields the vertical line. In the " O N state the minimal potential has to be larger than XO. In this state v(0) = JE Jr I(0) i, = 0 and V = U = 1, which leads to JE JI minimum[I(0)] 21 xo and yields the oblique line. The asterisk corresponds to the values J, = -7 and I = -0.25 used in the simulations for b-c. (b) The average firing rate of two excitatory neurons (upper panel) and the inhibitory neuron (lower panel) found from a simulation of the equations (equation 2.1) for a network with 60 neurons using the above parameters, except that we include noise amplitude T = 0.0006 (equation 2.4). We chose the initial conditions such that the neurons did not get stuck in an " O N state. The heavy line for the excitatory neurons refers to one with an orientation preference 0 - 00 = 3' while the thin line refers to one with 0 - 00 = 27". The period is 2x/w = 3.4. (c) The tuning curve, or time-averaged output of an excitatory neuron as a function of its orientation preference relative to the orientation of the stimulus. The average was calculated from simulations of the network with an averaging time of approximately 20 periods. The dots indicate the orientation preferences of the excitatory neurons featured in b.
+ +
+ +
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
558
0, is the full extent of the tuning curve (0, = 80" in Figure 2c). This can be understood by recalling that only neurons with the same orientation preference are connected by the intercluster couplings. Thus for AOo > 0, there are no pairs of active neurons that have the same orientation preference and the effective interaction between the clusters must vanish. Second, r(A$; Ado) is not monotonic in A&. Third, r(A$; A&,) is not an odd function of A$. This implies that the phase equations cannot be described in terms of a potential, that is, the interaction terms in equation 4.2 cannot be written in the form SW/B$R. This has important consequences on the dynamics of a network with many stimulated clusters (Discussion). The form of r(A$; Ado) indicates that it contains significant contributions from high harmonics in A$ (Figure 3a). We find that a good approximation is F(A$; A0,)
x
r0+ rl sin(A$ + a l )+ I'2sin(2A$ + (12)
(4.3)
as illustrated for A& = 15" in Figure 3a. The zeroth harmonic, with amplitude ro, represents the shift in the period of the oscillations and is presently irrelevant. The amplitudes rl and I'2 decrease monotonically with AOo and are zero for A00 > 0, (Fig. 3b). An unexpected result is the presence of large phase parameters, 01 and 02. They are nonzero even at A00 = 0" and increase with increasing values of A&. The nonzero value of the phase parameters appears to originate from the inhibitory feedback within each cluster. Specifically, the long-range excitatory synaptic input to weakly active neurons, that is, those stimulated away from their preferred orientation, increases the activity of these neurons and, in turn, increases the activity of the inhibitory neuron. This indirect activation of the inhibitory neuron contributes an "inhibitory" component to the interaction between the clusters. For sufficiently large values of AOo, all connections between a pair of clusters involve weakly active neurons and the inhibitory contributions dominate r(A$; A&). 4.3 Dynamics of 'Itvo Interacting Clusters. We consider in detail the case of only two interacting clusters, for which one can subtract the equation of $ R ' ( t ) from that for $R(t) (equation 4.2) to obtain an equation for the phase difference A$(t):
A$(t)
=
-F(A$;AOo)
+ V(t)
(4.4)
where f'(A.ri,;A00)= r(A$;A&) - F(-A$;A&) and V ( t ) = &(f). In the absence of noise, the steady-state solution of equation 4.4 is a fixed point and the phase difference between the two clusters will approach is determined from the constraints a constant, A&. The value of f'(A$o;A60) = 0 and Sl='(A$o;A8~)/8(A$)> 0. Note that f'(A$;A&)
Synchronization of Neuronal Assemblies
559
Kl i;m
0.4 W
n 3 k 1
a r a
a4
0
0"
45" RELATIVE ORIENTATION, As, 45"
90"
0"
90"
Figure 3 The interaction in a network of weakly coupled clusters. (a) The form of the effective long-range interaction between pairs of phase variables as a function of the relative difference in their phase (equation 4.2). Each curve corresponds to a different value of A&, the relative orientation of the stimuli. Note that the curves for A80 = 75" and 90" are essentially flat. The open symbols correspond to the form of F(A$J;15') given by its first three harmonics (equation 4.3).(b) The dependence of the amplitude and phase (radians) parameters for the first two harmonic of the interaction (a) on 0, (equation 4.3).
560
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
is an odd function of A$. Thus A$ = 0 and x are always zeros of f'( A$; A&), although they are not necessarily stable solutions. Further, the fixed points form degenerate pairs, &A$,,, that, except for the cases A,~J, = 0 or x , correspond to two different states. The shape of F(A$; A&) is shown in Figure 4a for several values of A&,. For small values the stable state is A& = 0. As A00 is increased (beyond 6" for our parameters) the fixed point moves to a nonzero, intermediate value of A$,,. As A00 is further increased (beyond 36") the stable fixed point becomes A&, = x and remains so until the force vanishes (beyond BOO). This behavior can be qualitatively understood by approximating F(A$J;A&)in terms of its first two harmonics (equation 4.31, that is, F(A$;AOo) x 2r1COSNI sin(A$) 2r2 cosa2sin(2A$). When the first harmonic dominates the interaction, as occurs when the value of aI is substantially smaller than x/2 and is substantially larger than r2,the phase difference is zero. This situation corresponds to small values of A0, (Fig. 3b and c). Similarly, a value of a1 near x leads to a phase difference of x , as occurs for large values of A&. When the value of a1 is near x / 2 , corresponding to intermediate values of Aso, the contribution from the first harmonic is of the same magnitude of that from the second. This gives rise to the pronounced anharmonic shape of f'( A$; A8) (Fig. 4a) and to an intermediate phase shift A$,, w cos-'( 4'1 cos 4 2 r 2 cos 0 2 ) . In the presence of noise, ij(t) in equation 4.4, the phase difference A$ fluctuates in time rather than approach a fixed value. The average phase coherence between the two clusters can be expressed by the intercluster correlation function
+
(4.5) with bVR(f)= V R ( ~ )(VR(t))fand, as before, denotes an average over time. The correlation function can be calculated from the unperturbed limit cycle of a single cluster and the phase dynamics (Appendix B). Since the clusters are identical, and the interaction between them is symmetric, an extremum will always occur at 7 = 0. The correlation function for several values of A00 are shown in Figure 4b-d. When both stimuli are aligned, that is, A& = 0", the Figure 4 Facing page. Aspects of the dynamics in a network with two clusters. (a) The force that acts on the difference between two clusters as a function of their relative phase difference (equation 4.2). Each curve corresponds to a different value of A&; those for A& = 75" and 90"are essentially flat. (b-d) The intercluster correlation function of the phase differencebetween two clusters as a function of time and different values of A00 (equations4.5 and B.4). The thin line refers to a low level of noise, 1/T = 33, and the thick line refers to an intermediate level, 1/T = 3.3. The network was equilibrated for approximately 150 periods and the correlation functions were averaged over an additional 150 periods.
Synchronizationof Neuronal Assemblies
561
correlation has a prominent peak at T = 0 (Fig. 4b). As the relative angle between the stimuli is increased, C M ~ ( Tdevelops ) a double peak that reflects the fluctuation of the network between two stable intermediate phase shifts. The presence of these phase-shifts also causes a minimum to occur at T = 0. These features are seen at A& = 30", for which
m0
I
I
I
-2a 0.4
I
I
'".
z 0
6 g z
o
a
W
$ I
3
j -0.4 a W
I-
z
a
8
0
21T
lT PHASE DIFFERENCE,A$
1.0
a W
k 3
cn
J
0 a
0
W
I-
I -1
0
1
-1
0
TI ME, T (Units of 27r/w
1
1
562
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
A$o N f1.3 (Fig. 4c). Note that, in practice, the intermediate phase shift may show up as a single peak at either a positive or a negative value of T if the activity of the network is averaged for only a short time. Lastly, the peak amplitude of CRR,(T)is not a monotonic function of A00 (cf. Fig. 4b-e). Further, while noise suppresses the amplitude of C,,(T) for any value of Ado, the suppression is greatest for intermediate values (cf. thick versus thin line in Fig. 4c). These features reflect the nonmonotonic behavior of I?(A$; A&) with respect to A00 (Fig. 3a). We calculated the equal-time correlation coefficient, CRR~(O), as a function of the relative orientation of two stimuli, A&, and for two levels of noise, ? (Fig. 5). The value of the coefficient rapidly decreases as a function of the relative orientation for either case. Beyond approximately the full-width at half-maximum of the tuning curve, 22" for our parameters (Fig. 2c), the coefficient becomes negative as a consequence of the substantial phase-shifts that occur for large values of A&. However, as the magnitude of the interaction is reduced for these angles, the corresponding magnitude of the coefficientsis also significantlyreduced, particularly at high levels of noise. 5 Discussion
Our main result is that a weak, fixed synaptic coupling between clusters of neurons can generate an effective interaction between the phases of their oscillatory response that depends strongly on the distribution of activity within each cluster. Thus the interaction is sensitive to the dissimilarity of the external inputs that stimulate the clusters. This result implies that stimulus-dependent synchronizing connections, postulated ad hoc in a previous network model of phase oscillators (Sompolinsky ef al. 1990,1991), can originate from neuronal dynamics without the need to invoke mechanisms of fast synaptic modification. This conclusion is consistent with the results of Konig and Schillen (19911, who simulated a network with time delayed connections and with the initial reports of Sporns et al. (1989). Our phase description is strictly valid only in the limit of weak intercluster coupling. In practice, the results of numerical calculations of the full equations for the model (equation 2.1) indicate that the phase model qualitatively describes the dynamics of the clusters even when the synaptic input a neuron receives via intercluster connections is about 5% of its total input ( E = 0.021~;data not shown). The time it takes to synchronize the output of two clusters from an initial, unsynchronized state is relatively short, about three cycles for this strength of interaction (insert to Fig. 5). In contrast to the ad hoc assumption in a previous work (Sompolinsky et al. 1990, 19911, the present analysis shows that dissimilarity in the external stimuli for each of the two clusters not only reduces the amplitude
Synchronization of Neuronal Assemblies
h
O
v
1.0
..h
7 A
I
U
563
h \
0.5
A
Ct-A--LA
ae, oo
A
0 v
i
0
a
0.5 TIME, t (Units of 2 r / w )
1
a
3
8 -0.5-I
I
I
RELATIVE ORIENTATION, Atlo
Figure 5: The equal-time intercluster correlation coefficient for the phases difference between two clusters as a function of A00 (equations 4.5 and B.4 with T = 0). This coefficient is a measure of the discrimination capability of the network. The thin line is for 1/T = 33, while the thick line is for 1/T = 3.3. The inset shows the amplitude of the coefficient during consecutive periods following the presentation of stimuli. Equations 2.1-2.4 were simulated numerically with the parameters used in the phase model (legend to Fig. 2a), l / T = 3.3 and E = 0.021E. Each datum reflects an average over 64 random initial conditions of the network. of their effective interaction but also induces a tendency to form phaseshifts. When only two clusters are stimulated, the phase-shifts appear in the intercluster correlation function (equation 4.5; Fig. 4c-e). Large differences in orientation between the two stimuli result in a phase shift of 7r (Fig. 4d and e). The phase shifts are less than 7r for intermediate differences in orientation and disappear for small differences. Our results with regard to the occurrence of phase shifts are in apparent contradiction to those of Schuster and Wagner (1990a). These authors studied the phase interaction between weakly coupled clusters of neu-
564
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
rons and claim that significant phase shifts do not occur. In contrast to the present work, the clusters in the model of Schuster and Wagner (1990a) had uniform external inputs and, further, their analysis was restricted to parameters near a Hopf bifurcation where the nonlinearities in the dynamics are weak. Our results are consistent with the simulations of Schillen and Konig (19911, where phase-shifts in the correlation between the output of two clusters are evident (see their Fig. 4). There is currently little experimental evidence for phase shifts among the oscillatory responses of neurons in visual cortex [but note Fig. Ig in Engel et al. (1991b)l. This is in apparent disagreement with the predictions of our model. One possibility is that the limit of weak, long-range coupling is inappropriate. Yet this limit is suggested from the experimental evidence on stimulus dependent synchronization across visual cortex (Eckhorn et al. 1988; Gray et al. 1989). In brief, stimuli outside the receptive field of a neuron may affect the cross-correlogram between it and other cells but these stimuli do not significantly perturb the magnitude or form of its autocorrelogram. This suggests that the effective interaction between distant neurons affects only their timing and not their rate of firing. A second possibility is that phase-shifts are particular to our choice of local architecture (Fig. 1). The numerical studies of Konig and Schillen (1991) make use of an architecture with solely excitatory connections plus synaptic delays, rather than inhibitory feedback. As mentioned above, the output of different clusters in their model exhibits phase-shifts. Further, Hansel et al. (1992) recently derived the form of the phase interaction between two Hodgkin-Huxley neurons. They show that shifts occur for a range of inputs with neurons coupled either by synapses or electrotonic junctions. Thus a body of evidence suggests that phase-shifts are a generic feature of the interaction between weakly coupled neuronal oscillators. There are a number of experimental issues that relate to the observation of phase shifts. The fully averaged cross-correlogramis symmetric in the presence of shifts. However, the cross-correlogram is likely to appear asymmetric when the averaging is incomplete so that only one of the two possible phases, A,$ = ~ c A(Fig. ~ J 4a), ~ dominates the interaction. Thus asymmetric cross-correlograms, traditionally interpreted as the signature of monosynaptic connections (Perkel etal. 19671, may in some cases reflect phase-shifted correlograms that have been averaged for too short a time. A second issue is that fluctuations in cortical activity may make shifts difficult to detect. The amplitude of the phase-shifted correlograms is expected to be reduced compared with correlograms without phase shifts (cf. Fig. 4a and c-e). This may significantly lower the signal-to-noise ratio of shifted cross-correlograms. However, even in the presence of noise stimulus-dependent phase-shifts should lead to a change in the shape of the cross-correlogram that depends on the form of the stimuli. Indeed, cross-correlogramswhose shape depends on the orientation of the stimulus have been observed (Ts’o et al. 1986). Lastly, both noise and variations
Synchronization of Neuronal Assemblies
565
in the intrinsic frequency of the oscillation will broaden the phase-shifted peaks in the correlogram. This may cause a shifted correlogram to appear as one with a relatively broad central peak. Such correlograms have been reported in recent studies (Nelson et al. 19921, although it is unclear if they result from the mechanism we propose. We suggest that the existence of phase shifts in the oscillatory part of neuronal responses to dissimilar stimuli deserves further experimental scrutiny. The presence of phase parameters can lead to dramatic changes to the phase dynamics (equation 4.2) when more than two clusters of neurons are stimulated. While the detailed behavior depends on the form of the intercluster interaction, r R R ' ( & - $ J R ~ )qualitative , aspects of the behavior may be accounted for by the simplified model E-'&(f)
=
-
C K ( R - R') J(At90)sin ($R
- $R!
+ a(A00))
R'#R
-k
VR(f)
(5.1)
Here At90 = OO(R) - Oo(R') is the relative orientation of the particular two of stimuli that act on a pair of clusters. The interaction parameter, J( A@,),measures the average overlap of the activities in a pair of clusters. It decreases monotonically with increasing values of At90 and vanishes for A& > &, where 0, is the full width of the tuning curve. Conversely, the phase parameter .(A&) increases monotonically with Ado. As before, K(R - R') specifies the spatial extent of the long-range connections (equation 2.1), and V R ( f ) is a gaussian noise (equation 4.2). The above model explicitly expresses the dependence of the amplitudes and phases of the interaction between the clusters on relative phases of each cluster on the spatial distribution of gradients in the orientation of the stimuli. When the phase parameters a(At90)are zero, as assumed in a previous work (Sompolinsky et al. 1990,1991), the network is unfrustrated. In the absence of noise the stimulated clusters will synchronize with zero phase shifts. In contrast, nonzero values of .(A&) may induce substantial frustration into the network and lead to a stable state with a complicated pattern of phase shifts. Further, the dynamics of the network is not governed by an energy function and thus the phases may not converge to fixed values. In cases where the values of phase parameters are large, such as when the stimulus contains sufficiently large spatial gradients, it is likely that the phases of each cluster, @ R ( t ) , will fluctuate chaotically in time. The phase model proposed here (equation 5.1)is likely to have validity beyond the specific architecture and dynamics of the circuit in the present work (Fig. 1). In fact, the simulation results of other circuits proposed for the 40 Hz oscillations in visual cortex (Sporns et al. 1991; Buhmann and von der Malsburg 1991;Konig and Schillen 1991; Wilson and Bower 1991) can be interpreted by this phase model. Thus, the model may provide a useful framework to probe the nature of spatiotemporal patterns of neuronal responses and their role in sensory processing.
566
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
Appendix A Here we sketch our derivation of the phase description from the full dynamics of the network. The equations for the dynamics of the full network (equation 2.1) are first reduced to a set of equations that involve only the potentials U R ( ~ )and , U R ( t ) , the input I, and the noise & ( t ) within the clusters. This is accomplished by averaging equation 2.1 over all orientations 8, so that
To close these equations, one must obtain a relationship between the average output V Rand the average potentials, U R ( ~ and ) U R ( f ) . Subtracting equation 2.1 from equation A.l and expanding all terms to first order in E yields
where
with g‘(x) = dg/dx and & ( x ) f g ( x ) - (g(x))e. Substitution of equations A.2 and A.3 into equation A.l gives
Synchronizationof Neuronal Assemblies
567
where
+
Y R R ! ( V RU, p ; f ) = K ( R - R’)G[UR(~)] YRR~(VR U .p ;
(A.5)
t)
and G(x) is defined by equation 3.3. The intercluster interaction term in equation A.4 is nonlocal in time. However, for small values of E the potentials can be approximated by vR(t) = %[wt $ R ( f ) ] and vRl(t) = G , [ w t $RJ(t)],where we have used equation 4.4 and the fact that the phases vary slowly on the time scale of the period for the oscillations, 27r/w-’. Substituting this form into equation A.4 results in equations that are now local in time. These equations represent a system of weakly coupled, two-dimensional limit cycles. By an appropriate average over the fast variables, they are further reduced to a set of equations (equation 4.2) that involves only the slow, phase variables, $ R ( f ) (Fig. 3). For details of this reduction, see Kuramoto (1984).
+
+
Appendix B Here we sketch our calculations of the intercluster correlation function (equation 4.5) in terms of the phase dynamics. The correlation can be expressed as
where (. . .), denotes averaging over time and over the noise in the phase equations (equation 4.2) and GV;;[wt $‘R(f)] = V { [ w t $ R ( f ) ] - ( V R ( f ) ) l where V;;[wt+ $ R ( t ) ] x G { G [ w f+ $ R ( f ) ] } is the solution of the equations for the unperturbed cycle (equations 2.2, 3.3, and 3.4). If we restrict ourselves to values of T that are on the order of w-I, we can make the approximation $R, ( t T ) x $,Q(t). , In this limit C R R( T~) depends only on fluctuations in the phase difference A$(t) = $Rf(t) - @ R ( t ) . For the case when only two clusters are stimulated, the form of equation 4.4 implies that the equilibrium distribution of the phase difference, a stochastic variable, is of a Gibbs’ form, that is,
+
+
+
D(A$)
(8.2)
o< ,-W(A@)/ZT
where the potential W(Ali,) is given in terms of the interaction f’(A$; Ado) (equation 4.4) by
1
A$
W(A$)
=
0
d$r($; Ado)
(8.3)
We thus arrive at
C R R ~ ( T=)
+ +
Jp(d$/27r)D($) Jo2””-’(wdt/27r)GV;;(wt)GV;;,(wt WT qj)
([W3t)12)r
Note that this result for C , ~ ( T ) is valid only for values
T
x
0 (w-’).
(8.4)
568
E. R. Grannan, D. Kleinfeld, and H. Sompolinsky
Acknowledgments
We thank B. Friedman, C. D. Gilbert, D. Hansel, J. A. Hirsch, P. C. Hohenberg, P. Konig, 0. Sporns, and D. Y. Ts'o for useful discussions. D. K. and H. S. thank the Aspen Center for Physics for its hospitality. This work was supported, in part, by the Fund for Basic Research administered by the Israeli Academy of Arts and Sciences and by the U.S.-Israel Binational Science Foundation.
References Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., and Palm, G. 1989. 9namics of neuronal firing correlation: Modulation of effective connectivity. I. Neurophys. 61,900-917. Amari, S.-I. 1972. Learning patterns and patterns of sequences by self-organizing nets of threshold elements. IEEE Trans. Comp. 21, 1197-1206. Bouyer, J. J., Montaron, M. F., and Rouged, A. 1981. Fast fronto-parietal rhythms during combined focused attentive behaviour and immobility in cat: Cortical and thalamic localization. Electroenceph. Clim. 51, 244-252. Buhmann, J., and von der Malsburg, Ch. 1991. Sensory segmentation by neural oscillators. In Proceedings of the Znternational Conference on Neural Nefworks, pp. 603-607, Vol. 11. Douglas, R. J., and Martin, K. A. C. 1990. Neocortex. In Synaptic Organization ofthe Brain, 3rd ed., G. M. Shepherd, ed., pp. 356-438. Oxford University Press, New York. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reiboeck, R. J. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analysis in the cat. Biol. Cybern. 60, 121-130. Engel, A. K., Konig, P., Kreiter, A. K., and Singer, W. 1991a. Interhemispheric synchronization of oscillatory neuronal responses in cat visual cortex. Science 252, 1177-1180. Engel, A. K., Kreiter, A. K., Konig, P.,and Singer, W.1991b. Synchronization of oscillatory neuronal responses between striate and extrastriate visual cortical areas of the cat. Proc. Natl. Acad. Sci. U.S.A. 88, 6048-6052. Freeman, W., and van Dijk, B. W. 1987. Spatial patterns of visual cortical fast EEG during conditioned reflex in a rhesus monkey. Brain Res. 422,267-276. Gilbert, C. D., and Wiesel, T. N. 1989. Columnar specificity of intrinsic horizontal and corticocortical connections in cat visual cortex. j . Neurosci. 9, 2432-2442. Gray, C. M., Konig, P.,Engel, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit intercolumnar synchronizationwhich reflects global stimulus properties. Nature (London) 338,334-337. Hansel, D., Mato, G., and Meunier, C. 1992. Phase dynamics for weakly coupled model neurons. CNRS preprint. Hopfield, J. J. 1984. Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A. 8,3088-3092.
Synchronizationof Neuronal Assemblies
569
Ketchum, K. L., and Haberly, L. B. 1991. Fast oscillationsand dispersive propagation in olfactory cortex and other cortical areas: A functional hypothesis. In Olfaction: A Model System for Computational Neuroscience, J. L. Davis and H. Eichenbaum, eds., pp. 69-100. MIT Press, Cambridge, MA. Konig, P., and Schillen, T. B. 1991. Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comp. 3, 155-166. Kuramoto, Y. 1984. Chemical Oscillations, Waves, and Turbulence. Springer-Verlag, New York. Nelson, J. I., Salin, P. A., Munk, M. H., Arzi, M., and Bullier, J. 1992. Spatial and temporal coherence in cortico-cortical connections: A cross-correlation study in areas 17 and 18 in the cat. Visual Neurosci. 9,21-37. Perkel, D. H., Gerstein, G. L., and Moore, G. P. 1967. Neuronal spike trains and stochastic point processes. 11. Simultaneous spike trains. Biophys. J. 7, 419-440. Schillen, T. B., and Konig, P. 1991. Stimulus-dependentassembly formation of oscillatory responses: 11. Desynchronization. Neural Comp. 3, 167-178. Schuster, H. G. 1991. Nonlinear Dynamics and Neuronal Networks: Proceedings of the 63rd W. E . Heraues Seminar, Friedrichsdorf 1990. VCH, New York. Schuster, H. G., and Wagner, P. 1990a. A model for neuronal oscillations in the visual cortex: I. Mean-field theory and the derivation of the phase equations. Biol. Cybern. 64,77432. Schuster, H. G., and Wagner, P. 1990b. A model for neuronal oscillations in the visual cortex: II. Phase description of the feature dependent synchronization. Biol. Cybern. 64, 83-85. Sompolinsky, H., Golomb, D., and Kleinfeld, D. 1990. Global processing of visual stimuli in a network of coupled oscillators. Proc. Natl. Acad. Sci. U.S.A. 87,7200-7204. Sompolinsky,H., Golomb, D., and Kleinfeld, D. 1991. Cooperative dynamics in visual processing. Phys. Rev. A15 43,6990-7011. Sporns, O., Gally, J. A,, Reeke, G. A., and Edelman, G. M. 1989. Reentrant signaling among simulated neuronal groups leads to coherency in their oscillatory activity. Proc. Natl. Acad. Sci. U.S.A. 86, 7265-7269. Sporns, O., Tononi, G., and Edelman, G. M. 1991. Modeling perceptual grouping and figure-ground segregation by means of active reentrant connections. Proc. Natl. Acad. Sci. U.S.A. 88, 129-133. Ts’o, D. Y., Gilbert, C. D., and Wiesel, T. N. 1986. Relationshipsbetween horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J. Neurosci. 6, 1160-1170. Wilson, W. A., and Bower, J. A. 1991. A computer simulation of oscillatory behavior in primary visual cortex. Neural Comp. 3, 498-509. Wilson, H. R., and Cowan, J. D. 1972. Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. ].12,1-24. Winfree, A. T. 1980. The Geometry of Biological Time. Springer-Verlag,New York. Received 4 February 1992; accepted 25 January 1993.
This article has been cited by: 1. B. Pesaran, D. Kleinfeld. 2009. Enter the ratrix. Proceedings of the National Academy of Sciences 106:46, 19209-19210. [CrossRef] 2. Demian Battaglia, Nicolas Brunel, David Hansel. 2007. Temporal Decorrelation of Collective Oscillations in Neural Networks with Local Inhibition and Long-Range Excitation. Physical Review Letters 99:23. . [CrossRef] 3. K. Pyragas, O. V. Popovych, P. A. Tass. 2007. Controlling synchrony in oscillatory networks with a separate stimulation-registration setup. Europhysics Letters (EPL) 80:4, 40002. [CrossRef] 4. Sitabhra Sinha, Jayanta Basak. 2002. Dynamical response of an excitatory-inhibitory neural network to external stimulation: An application to image segmentation. Physical Review E 65:4. . [CrossRef] 5. P. A Tass. 2002. Effective desynchronization with a stimulation technique based on soft phase resetting. Europhysics Letters (EPL) 57:2, 164-170. [CrossRef] 6. M.B.H. Rhouma, H. Frigui. 2001. Self-organization of pulse-coupled oscillators with application to clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 23:2, 180-195. [CrossRef] 7. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef] 8. U. Ernst, K. Pawelzik, T. Geisel. 1998. Delay-induced multistable synchronization of biological oscillators. Physical Review E 57:2, 2150-2162. [CrossRef] 9. Roger Rodriguez, Henry Tuckwell. 1996. Statistical properties of stochastic nonlinear dynamical models of single spiking neurons and neural networks. Physical Review E 54:5, 5585-5590. [CrossRef] 10. S. Campbell, DeLiang Wang. 1996. Synchronization and desynchronization in a network of locally coupled Wilson-Cowan oscillators. IEEE Transactions on Neural Networks 7:3, 541-554. [CrossRef] 11. H. Sompolinsky , M. Tsodyks . 1994. Segmentation by a Network of Oscillators with Stored MemoriesSegmentation by a Network of Oscillators with Stored Memories. Neural Computation 6:4, 642-657. [Abstract] [PDF] [PDF Plus] 12. D. Hansel, C. Meunier. 1993. Clustering and slow switching in globally coupled phase oscillators. Physical Review E 48:5, 3470-3477. [CrossRef]
Communicated by Laurence Abbott
Dynamics of Populations of Integrate-and-Fire Neurons, Partial Synchronization and Memory Marius Usher Computation and Neural Systems 226-76, Caltech, Pasadena, C A 91125 USA
Heinz Georg Schuster Institut fiir Theoretische Physik, Universitat Kiel, 0-2300 Kiel 2, Germany
Ernst Niebur Computation and Neural Systems 226-76, Caltech, Pasadena, C A 91125 USA
We study the dynamics of completely connected populations of refractory integrate-and-fire neurons in the presence of noise. Solving the master equation based on a mean-field approach, and by computer simulations, we find sustained states of activity that correspond to fixed points and show that for the same value of external input, the system has one or two attractors. The dynamic behavior of the population under the influence of external input and noise manifests hysteresis effects that might have a functional role for memory. The temporal dynamics at higher temporal resolution, finer than the transmission delay times and the refractory period, are characterized by synchronized activity of subpopulations. The global activity of the population shows aperiodic oscillations analogous to experimentally found field potentials. 1 Introduction
Most artificial neural networks are based on binary McCulloch-I'ittG neurons that have no intrinsic temporal characteristics. As opposed to these simplified units, real neurons perform temporal integration over their inputs with some specific decay constant and have refractory periods. Model neurons satisfying these constraints are often called integrate-andfire neurons. Although the dynamic behavior of a single such neuron is straightforward, their population dynamics are highly complex. Recently, several studies have shown that under specific assumptions analytical results characterizing such populations of neurons can be obtained. For example, Amit and Tsodyks (1992) have shown that on a longer time scale the dynamic behavior of integrate-and-fire neurons can be averaged Neural Computation 5, 570-586 (1993) @ 1993 Massachusetts Institute of Technology
Populations of Integrate-and-FireNeurons
571
out and characterized by continuous variables (firing rates or currents). However, this transformation is based on the assumption that there is no synchronicity among inputs and once made, the question of dynamics at higher temporal resolution cannot be addressed. Gerstner and Van Hemmen (1992) proposed a stochastic model for spiking neurons that incorporates refractory periods and transmission delays, but which disregards membrane decay properties. The issue of temporal modulations has been investigated by Mirollo and Strogatz (1990), who showed that a population of integrate-and-fire neurons in a strong external field (receiving a strong external input and acting as pulse oscillators) will phaselock and reach synchronicity if they have excitatory all-to-all connections. Van Vreeswijk and Abbott (1992) analyzed the behavior of completely connected populations of excitatory neurons and found that even in the absence of external input, such populations of neurons can exhibit sustained activity due to self-interaction. In particular, they showed that, in the absence of noise, the population may lock in one of several patterns of cyclic activity. The purpose of this work is to further analyze the dynamic characteristics of such populations in the presence of noise and small external inputs, and to find the specific way that a population responds to different inputs. In the first part we develop a discrete-time mean-field model, using a master equation that takes explicitly into account the stochastic behavior due to noise. Using the master equation and numerical simulations we show that for some values of the external field the population is dominated by two attractors, a self-sustained activity state and a silent state. These states are fixed points of the dynamics, and we discuss their basins of attraction. We also show that the network exhibits hysteresis and thus can function as a memory system that reacts to external inputs in a different way than systems consisting of McCulloch-Pitts neurons. In the last section, we study the temporal dynamics at a finer temporal resolution. Taking explicitly into account axonal and synaptic delays and refractory periods, we obtain synchronized activity of subpopulations and aperiodic oscillations of the whole population's activity.
2 The Model
Consider a fully connected population of N excitatory neurons, each one characterized by a continuous variable t i that represents the cell's potential (1 5 i 5 N). Each neuron integrates over its inputs, and once it reaches a threshold (chosen without loss of generality to be l),the neuron fires, sends its output to the other neurons in the population, and resets its potential to zero. For simplicity, we will use a synchronous updating rule where each iteration step corresponds to a typical time for the spike dynamics (including the refractory period) and transmission delays of
M. Usher, H. G. Schuster, and E. Niebur
572
about 2-3 msec. Thus for neuron i,
where X is a decay constant (0 5 X 5 l ) , I > 0 is the coupling constant, 8 is the unit step function, and the external input is characterized by a mean value h and a gaussian noise term (i (of zero mean and standard deviation (5). The first step function adds the contributions from other cells that are firing at time t, and the second step function enforces refractoriness (a neuron cannot fire at two consecutive time steps). In order to test the validity of our synchronous discretization scheme, we have to test whether its dynamic properties depend upon the length of the iteration steps (in the limit of infinitesimal time steps, our scheme should reduce to a continuous system of differential equations representing currents in the cells). As it stands, equation 2.1 implies that both the delay time and the refractory period are equal to the discretization steps. Thus in order to increase the temporal resolution (without changing the actual values of the biological parameters, i.e., refractoriness and delay time), equation 2.1 should be modified, so that the refractory period and the delay time extend over more than one (now smaller) time bin. We shall show in the last section that the results obtained from equation 2.1 (at low time resolution) are compatible with those obtained from the modified delayed equation. However, at the higher resolution new phenomena are revealed. As shown in Van Vreeswijk and Abbott (1992) for zero external input and noise (h = ( i = O), the population may either remain silent or segregate into a cyclic pattern of activity of length M. At time t a subgroup of n' cells fire and subsequently, at time t 1, a different subgroup of n'+' cells will fire etc., satisfying Ci+'nr = N. For X = 1 the length of the cycle M is limited (Van Vreeswijk and Abbott 1992) by: IN < M < - 2IN (2.2)
+
IN - 1
NI - 1
Although the length of the cycle is limited by this equation, this does ~, denot determine the sequence of activations nr,n'+', . . . ,t ~ ' + which pends on the initial conditions zi(0). As we shall show in the next section, under the influence of noise the temporal behavior is asymptotically characterized by homogeneous sequences [i.e., n ( t ) = n(t 1 ) .. . = n(t M ) ] . Thus the system converges to one of two possible fixed points, a homogeneous sequence of activations or the silent state. This situation is somewhat analogous to a ferromagnetic system.
+
+
3 Master Equation
The stochastic system in equation 2.1 is completely characterized by the . . . ,ZN). Since our system is fully full probability distribution P'(ZI,Z~,
Populations of Integrate-and-FireNeurons
573
connected, a mean-field approach yields a lower dimensional-probability distribution P'(z), 1 P'(z) = -
c J 6(z
- Zi)P'(Zl
N i
*
..
3
where Pf(z) is the probability that neuron i will have the synaptic potential z at time t, and P'(z) is the mean-field average probability of having a postsynaptic potential z in the system at time t. The time dependence of P'(z) is obtained from equation 2.1:
P'+'(z) = m'6(z)+
/
1
-m
dz' < 6[z- (Xz'
+ Jm'+ h + t)]> P'(z')
(3.1)
where m', the normalized fraction of active cells at time t is related to P'(z) via m' = JyP'(z)dz, and <> denotes the average over the noise probability distribution that is chosen to be gaussian: (3.2)
Inserting this into equation 3.1 we obtain,
P'+'(z) = m'6(z)+ J
1
dz'f(z - XZ' - Jm' - h)P'(z')
(3.3)
-W
from which mtfl can be calculated
m'"
W
=
dzP'+'(z) =
p
-w
dz'JW dzf(z - Xz' 1
- Jm' - h)P'(z')
(3.4)
3.1 Strong Decay. For X = 0, the integral in equation 3.4 factorizes and we obtain a nonlinear one-dimensional map:
m'+' = (1- m') Jw
l-\m'-h
P
d
Ap roximating the gaussian by IJ 7~12,we obtain
{
f(t) =
m'+' = _-_ 1 + tanh 1 m ' S h - l T 2
',
(3.5)
{2Tc0sh*(
(3.6)
The dynamic behavior of m' is governed by the parameters J, h, and T. In Figure la-c (solid lines) we plot the term on the right-hand side of equation 3.6 together with the identity function, for T = 0.2, 1 = 1.5, and different values of the external field h. For high external field h = 0.8 (Fig. la), the map has a single fixed point characterized by an activation value close to 0.5, which implies that for such high input the population segregates in two subgroups that fire consecutively (0.5 is the maximal
M. Usher, H. G. Schuster, and E. Niebur
574
0
0.5 mltl
0
1
0.5 mlt)
1
Id) 1
1
,. d
0.5 v
E
0
5
0.5
E
0 0
10
20
L
Figure 1: The one-dimensional map (equations 3.4 and 4.4) for J = 1.5, T = 0.2 and different external fields: (a) h = 0.8, equation 3.4 solid line, and equation 4.4 with nd = n, = 3 dashed line; (b) h = 0.6, equation 3.4 solid line and equation 4.4 with nd = n, = 2 long dashed line, nd = n, = 3 short dashed line; (c) h = 0.4, equation 3.4; (d) trajectories beginning from different initial conditions (mo= 0.1 or m0 = 0.7) for parameters corresponding to (b). Lines, solutions of equation 3.4; symbols, simulation with a population of N = 1000 neurons for the two initial conditions shown.
Populations of Integrate-and-FireNeurons
575
firing rate due to the refractoriness). The fixed point is stable and thus the temporal sequence of activations will manifest damped oscillations toward the stable fixed point. For intermediate values of the external field, the system has two stable fixed points and an unstable one (the intermediate one). Therefore, for such external fields the system has two attractors and can function as a memory system. The domains of attraction of each fixed point are unlike those in a McCulloch-Pitts system; the initial states that are attracted toward the high m value have intermediate m(O), while the very low and very high values m(0) < 0.14 or m(0) > 0.8 lead to the decay of the activation. This is illustrated in Figure Id, where we display the time sequence obtained from two different initial conditions using the one-dimensional map (lines) and a simulation with N = 1000 (symbols). A further decrease of the external field leads to the situation displayed in Figure lc, where the activity is always decaying. For 0 < X << 1 a two-dimensional map can be obtained, using a first order Taylor expansion of equations 3.3 and 3.4. For moderate values of X (e.g., X = 0.4),the system behaves qualitatively similar to a system with X = 0, that is, it has a single attractor with high m for a large external field, two attractors for intermediate external fields, and again a single attractor (with small rn) for small external fields. Due to the finite value of A, the transition points between these regimes occur at smaller field values than for X = 0 (e.g., for X = 0.4,the system has 2 attractors already for h = 0.4;data not shown).
3.2 Weak Decay. We have numerically solved the master equation for an initial distribution P ( z )corresponding to a fraction of active neurons mo, uniformly distributed otherwise. For X = 1, we found that the system always reaches a fixed point corresponding to a homogeneous sequence m' = 1/M (where M is compatible with equation 2.21, even for vanishing external input. This is illustrated in Figure 2a where beginning with a very high initial activation (which would lead to decay in the absence of noise), the system gradually recovers due to small gaussian noise (with CT corresponding to T = 0.05 in equation 3.6).This phenomenon is explained by the fact that noise causes diffusion of the probability distribution P(z),whose tail eventually reaches the threshold and turns the system on. Also due to noise, the system equalizes the size of the subgroups. Figure 2b shows the multimodal probability distribution (with four components corresponding to the M = 4 subgroups) reached after 20 iterations. The same stationary distribution is reached from all initial conditions. For X < 1, the noise-induced diffusion is balanced by the decay, and the system is not able to self-activate for all initial conditions. As in the small X case, the system will reach one of two attractors, depending on the initial conditions. In Figure 2c we display the time sequence obtained for X = 0.98, beginning with initial condition m0 = 0.28, and
M. Usher, H. G. Schuster, and E. Niebur
576
0
10
20 t
z
ICI
n
N
" PI
0
20 40 60 80 100 120 t
0 2
Figure 2: Solution m' of the master equation, equation 3.4, for @ ( z ) = m06(z 1)+ (I - mo)[O(z)- O(z - l)], = 1.5,T = 0.05,h = 0. (a) mtfor X = 1,mo = 0.44, (b) P'(z) for t = 30. (c) m tfor X = 0.98,m0= 0.28. (d) Probability distribution after t = 120 iterations, parameters as in (b). In both cases, the system reaches a fixed point of self-sustained activity.
P ( z )uniform otherwise. We observe damped oscillations that reach a fixed point of rn = 0.2, implying segregation into five equal subgroups. The corresponding multimodal probability distribution P(z) is displayed in Figure 2d. Noise tends to homogenize the initially nonhomogeneous sequence of activation by transporting probability from the larger to the smaller subgroups. Begining with a "too high" activation rnO= 0.44 and
Populations of Integrate-and-Fire Neurons
577
(a) 0.3
0.2-
0.1-
0
I
0
,I
0.2
I
I
I
0.4
0.6
0.8
1
z
0.3
--
0.2-
N
nl
0.1-
0
1
(C)
0.3
--
0.2-
N
D4
0.1-
0 0
0.2
0.4
I
I
1
0.6
0.8
1
z
Figure 3: P'(z), same parameters as in Figure 2, beginning with a high initial activation ma = o.M,PO(Z) = m06(z - I) + (1 - mo)[s(z) - e(z - I)]. (a) t = 3, (b) t = 9, (c) t = 120 iterations. The probability shrinks due to the decay, although, of course, JdzP'(z) = 1 for all t. After 5 iterations, m' M 0.
P(z)uniform otherwise, the activation decays to zero. In Figure 3 we display the probability distribution P'(z) at t = 3, t = 9, and t = 120. We observe that in spite of the diffusion, P'(z) shrinks and the populations remain inactive.
M. Usher, H. G. Schuster, and E. Niebur
578
The influence of the noise on the system dynamics can be summarized as follows: For very low noise (small T ) , the system reaches a limit cycle in which the system segregates into subpopulations of unequal sizes in each of which all neurons fire synchronously. The length of the limit cycle depends on the initial conditions. When the noise exceeds a certain value, a transition toward a fix-point representing segregation into equal subpopulations occurs, and the dependence on initial conditions reduces to the choice between 2 attractors (the silent and the active state). The noise value at which the transition occurs depends on the number of subpopulations in the cycle; for ] larger than, but near 1, cycles of long length are obtained, as shown in Figure 4a. In this case the different subpopulations are separated by small potential differences, and very low noise will transport neurons between the different subpopulations and contribute to their homogenization (Fig. 4b). As ] increases, the cycle length (and hence the number of subpopulations) decreases while the potential differences between them increase. Thus a higher noise level is required in order to transport neurons among the populations. A systematic simulation study of the I dependency shows the following: a
For ] = 1.9, X = 1, a cycle of order 3 is obtained in the absence of noise, and the temperature required for inducing the transition is T = 0.07.
a
For ] = 1.5, X = 1, a cycle of order 4 is obtained, and the temperature required for inducing the transition was T = 0.04.
a
For ] = 1.1 a cycle of length 14 was obtained, and a noise characterized by T = 0.01 was enough to induce the transition (Fig. 4b).
3.3 Inputs and Hysteresis. There are two ways by which a neural population can receive external inputs. The first one is via the initial state mo (i.e., a subgroup of the population is turned on), and the second one is by sustained external input (in form of postsynaptic potentials), that is modeled as an external field h. For h = 0, we have shown that the system has two attractors. This leads to the hysteresis phenomenon illustrated in Figure 5, where we show the average activation of the population, as we increase (or decrease) adiabatically the external field. We observe that for h = 0 the activation may have two values depending on history. Therefore, absence of excitation (or presence of small inhibition) will leave the population inactive. However, once positive excitation has reached the assembly causing it to fire, the population "remembers" and continues self-sustained firing even after the external input no longer reaches the population.
Populations of Integrate-and-FireNeurons
579
0.1
A
2
0.05
E
0
2
0
10 t
20
0
10 t
20
0.05
E
Figure 4 Long cycles for J = 1.1. (a) Without noise (T = 01, a cycle of length 14 is obtained, (b) a small level of noise (T= 0.01) leads to stochastic fluctuations around a homogeneous steady state. 4 Dynamics at Higher Time Resolution; Synchronicity and Field Po-
tentials
As we have already mentioned, equation 2.1 has to be modified in order to investigate the population dynamics on a time scale characterized by a finer discretization. This is accomplished by considering a bin size that equals a fraction, say l/n, of the characteristic delay time and refractory
M. Usher, H. G. Schuster, and E. Niebur
580
0.3
5
0.2
E
0.1
0 -0.1
0
-0.05
0.05 h
0.1
0.15
0.2
(b)
0.3
0.2
/
0.1
I
,+'
i A 0
0 -0.1
/I
-0.05
0
0 1
I
1
0.05
0.1
0.15
0.2
h
Figure 5: Hysteresis. The average mt is plotted as a function of the adiabatic field h. (The field was increased by 0.01 every 20 iterations, and the average activation m was plotted.) Lines represent numerical solutions of the master equation 3.4; symbols represent simulations of a population of N = 1000 cells. (a) Low noise (T = 0.05). Note the jumps caused by quanta1 transitionsbetween cyclic patterns of length M = 4 and M = 3. (b) Same for higher noise level T = 0.2. Parameters X = 0.98,J = 1.5 for (a) and (b).
period. We assume in the following that all pairs of cells are characterized by a single communication delay time (composed of axonal and synaptic delay), but we shall discuss the case when the delay is different from the refractory time. Each neuron integrates over its inputs, and once it
Populations of Integrate-and-FireNeurons
581
reaches the threshold, the neuron fires, sending its output to the other neurons in the population where it will arrive after a delay of length nd. After emitting the spike the neuron is reset to zero, and remains refractory for nr time steps. The modified equation is
where nd and n, stand for the delay time and the refractory period in units of the discretization, respectively. Increasing the temporal resolution corresponds to using larger values for these variables. We notice that the transformation from equation 2.1 to equation 4.1 implies a corresponding scaling of the decay constant X = X'J"n, and of the-temperature T = T/fi,,. The behavior of the neural population in the presence of noise can be characterized by a Markov process with memory of order n d ,
where the probability P'(z') is replaced by P(z',z", . . .), the probability that a neuron had value z' at time t, z" at time t - 1, and so on. We should notice that this probability is not multiplicative due to temporal correlations, for example, for the case nd = 2, Prob[z'(t) < l,z''(t - 1) < 11 = 1 - m(t)- m(t - 1) # [1 - m(t)l[1- m(t - 1)1 From equation 4.2, a map analogous to equation 3.6 can be obtained for X = 0:
The solutions of this delay equation are more complex than the anal-
ogous solutions of equation 3.6. However, special solutions that also satisfy equation 3.6 can be obtained easily. Consider the following 2 cases: 0
The activation for each subpopulation is spread homogeneously at = ... = m'+. the subrefractory time resolution, that is, m' =
M. Usher, H. G. Schuster, and E. Niebur
582
In this case, and for n d
m'+' = 1 - n,m' ~
2
= n,,
+
equation 4.3 transforms into:
1 tanh
Jm' + h - 1 T
(4.4)
where the time bin equals the delay and the refractory period. 0
Perfect synchronization of subpopulations, that is, mt # O,m'-' = . . . = mt-"d = 0 repeating cyclically modulo nd. Such periodic solutions can be studied easily, since they also satisfy equation 3.6.
We have shown that for X = 0, J = 1.5, T = 0.2 the population may converge to an attractor of high activity for high field values (e.g., h = 0.8, Fig. la, solid line), to the silent state for low field values (h = 0.4, Fig. lc, solid line), or to two attractors, for intermediate field values (h = 0.6, Fig. lb, solid line), depending on the initial activation. When the high resolution dynamics are taken into account (equation 4.3), the following characteristics are revealed: For h = 0.8 both the homogeneous case (equation 4.4 and Fig. la, dashed line) and the synchronized case (Fig. la, solid line) have nontrivial solutions. For intermediate field values h = 0.6, only the synchronized solution is nontrivial; Fig. l b (dashed lines) displays the map in equation 4.4 showing that for n d = n, 2 2, the only solution is the silent state. Numerical iteration of the map, equation 4.3, and simulations confirm that for high external field either a homogeneous or a modulated solution is obtained (depending on initial conditions), while for intermediate fields (that are essential for memory preservation) only the synchronized solutions remain. Finally, we present simulation results that display the high resolution dynamics, equation 4.1, for weak decay (X close to one). In Figure 6a, the system parameters were chosen identical to those used in Figure 2c (subject to the scaling of X and T)and n d = n, = 4. We see that the population's activity manifests rapid oscillations (of period equal to the delay time), with amplitude m = 0.2, reflecting a segregation into five populations (as in Fig. 2c). Figure 6b displays a close-up into Figure 6a showing that inside each such population, there is a relatively high but variable synchronicity that leads to slower and irregular waves superimposed on rapid oscillations (the periodicity of the signal in this synchronized state is imposed by the transmission delay). If the delay time is larger than the refractory period, a different behavior is obtained. In Figure 6c we illustrate the population dynamics for the same parameters as in Figure 6a, except that nd = 5, while n, = 4 as before. We observe that the activity oscillates with the period of the delay time, but the amplitude is increased to m = 0.5,implying segregation into 2 subpopulations. The reason for this change is that unlike in the previous case ( n d 5 n,) no activation is lost, that is, cells can send activation to themselves, since after the delay they are not refractory any
Populations of Integrate-and-Fire Neurons
583
Figure 6 Simulations with higher temporal resolution (equation 4.1), N = lo00 neurons. (a) Parameters as in Figure 2b, except nd = n, = 4; (b) close-up showing that the periodicity is determined by the transmission delay; (c) fld = 5, nr = 4; (d) parameters as in Figure 4a, nd = n, = 4, T = 0; (e) same with noise T = 0.01. more. In such a case, complete synchronization of the population is obtained for higher values of J in which all neurons fire with the period of the delay. In Figure 6d, we display the high resolution simulation ( n d = n, = 4),
584
M. Usher, H. G. Schuster, and E. Niebur
corresponding to small J values leading to long cycles (parameters as in Fig. 4). In the absence of noise, a periodic solution (periodicity 15nd) analogous to Figure 4a is obtained. Adding a noise factor equivalent to the one used in Figure 4b, we observe that the activity manifests irregular oscillations (Fig. 6e) whose maximal amplitude is rn = 0.05, implying segregation into approximately 20 subpopulations. The synchronization of each subpopulation varies stochastically, and thus the total activity is characterized by aperiodic spindles of high oscillations. Such spindles are highly suggestive of experimental field potentials or EEG recordings. Since these are also considered to reflect a global activity in some neural population, as it is the case in our model, we think that our simple model provides a simple explanation of possible mechanisms underlying these phenomena. 5 Discussion
Most neural network models are built of simplified units that do not take into account inherent properties of real neurons, such as refractory period and integration time. Such properties are essential for any computational process dealing with the temporal structure of neural activity. For example, in the absence of an integration constant, small jitter in the transmission of signals would wash out synchronicity effects. Thus networks based on refractory integrate-and-fire neurons are among the simplest models that may explain how the brain processes temporally structured information. However, despite the widespread use of integrate-and-fire neurons as models for single cells, only recently systematic studies of the population dynamics of such networks have been carried out (Amit and Tsodyks 1992; Ahlstrom and Levinsen 1992; Kuramoto 1992; Van Vreeswijk and Abbott 1992). In the present work, we have studied the behavior of large, completely connected networks of refractory integrate-and-fire neurons in the presence of homogeneous input and noise. Two results seem particularly noteworthy. One is the finding that at the time resolution of the synaptic delay, the presence of noise will always lead to a segregation of the system in neuronal groups of equal size. Thus on time scales larger than the transmission delay, our results support an interpretation of steady state firing rate as in Amit and Tsodyks (1992). However, at a finer time resolution, firing may be highly synchronized and temporally modulated with the periodicity of the delay time, as in the model proposed in Gerstner and Van Hemmen (1992). The second important result is the observation of hysteresis in the system. It is tempting to speculate that this is a mechanism employed for the neural implementation of short-term memory. The idea of memory being implemented in the brain by reverberating activity dates back at least to Caianiello and has received new vigor by recent observations showing
Populations of Integrate-and-Fire Neurons
585
that the activation of short-term memory is correlated with increased neural activity in neocortex (Miyashita and Chang 1988; Fuster 1990). Our results indicate that self-excitatory networks of integrate-and-fire neurons are able to manifest a basic form of this behavior, especially when they operate in the synchronized mode. As we have shown, when the system is in its memory mode, characterized by two attractors, only the synchronous (and not the homogeneously spread-out solution) is nontrivial. The main reason for this is that synchronized activity is more effective in activating other cells than randomly spread activity. This may also shed light on the widely argued discussion about the role of cortical oscillations in information processing. According to our model, aperiodic oscillations and synchronicity are emergent properties of neural populations with recurrent excitation, and contribute to its memory function. The dynamic properties of such systems are different from the memory systems based on McCulloch-Pitts neurons. For example, when transmission delays are equal or smaller than the refractory period, a transient external input is more effective in inducing sustained activity in populations of integrate-and-fire neurons if the input is at an intermediate value, rather than at a relatively high value. The significance of such dynamic characteristics to the actual physiological process underlying short-term memory, and the use of integrate-and-fire neurons for modeling physiological time dependent processes should be the subject of further research.
Acknowledgments EN is supported by the Office of Naval Research and wishes to thank HGS and the Institute for Theoretical Physics of the University of Kiel for their hospitality. HGS thanks Christof Koch for his kind hospitality during his stay at Caltech. MU is supported by a Bantrell fellowship. This work was supported by NATO Grant 911034COP.
References Ahlstrom, P., and Levinsen, M. T. 1992. Phase-locking structure of integrateand-fire models with threshold modulation. Phys. Lett. A 128(34), 187-192. Amit, D.J.,and Tsodyks, M. V. 1992. Quantitative study of attractor neural network retrieving at low spike rates. Network 2(3), 259-273. Fuster, J. M.1990. Inferotemporal units in selective visual attention and shortterm memory. J. Neurophysiol. 64,681-697. Gerstner, W.,and Van Hemmen, J. L. 1992. Associative memory in a network of ‘spiking’ neurons. Network 3, 139-164. Kuramoto, Y. 1992. Collective synchronization of pulse-coupled oscillators and excitable unit. Physica D 50(1), 15-30.
586
M. Usher, H. G . Schuster, and E. Niebur
Mirollo, R. E., and Strogatz, S. H. 1990. Synchronization of pulse-coupled biological oscillators. S I A M 1.Appl. Math. 50(6), 1645-1662. Miyashita, Y., and Chang, H. S. 1988. Neural correlate of pictorial short term memory in primate temporal cortex. Nature (London) 331, 68-70. Van Vreeswijk, C., and Abbott, L. F. 1992. Self-sustained firing in populations of integrate and fire neurons. S I A M I. Appl. Math. Received 14 September 1992;accepted 17 November 1992.
This article has been cited by: 2. A. N. Burkitt. 2006. A review of the integrate-and-fire neuron model: II. Inhomogeneous synaptic input and network properties. Biological Cybernetics 95:2, 97-112. [CrossRef] 3. Z. Yang, A. Murray, F. Worgotter, K. Cameron, V. Boonsobhak. 2006. A Neuromorphic Depth-From-Motion Vision Model With STDP Adaptation. IEEE Transactions on Neural Networks 17:2, 482-495. [CrossRef] 4. E. Del Moral Hernandez, Geehyuk Lee, N.H. Farhat. 2003. Analog realization of arbitrary one-dimensional maps. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 50:12, 1538-1547. [CrossRef] 5. Carsten Meyer , Carl van Vreeswijk . 2002. Temporal Correlations in Stochastic Networks of Spiking NeuronsTemporal Correlations in Stochastic Networks of Spiking Neurons. Neural Computation 14:2, 369-404. [Abstract] [PDF] [PDF Plus] 6. A. N. Burkitt , G. M. Clark . 2001. Synchronization of the Neural Response to Noisy Periodic Synaptic InputSynchronization of the Neural Response to Noisy Periodic Synaptic Input. Neural Computation 13:12, 2639-2672. [Abstract] [PDF] [PDF Plus] 7. Grigory Osipov, Jürgen Kurths. 2001. Regular and chaotic phase synchronization of coupled circle maps. Physical Review E 65:1. . [CrossRef] 8. Naoki Masuda, Kazuyuki Aihara. 2001. Synchronization of pulse-coupled excitable neurons. Physical Review E 64:5. . [CrossRef] 9. Henry C. Tuckwell . 2000. Cortical Potential Distributions and Information ProcessingCortical Potential Distributions and Information Processing. Neural Computation 12:12, 2777-2795. [Abstract] [PDF] [PDF Plus] 10. A. N. Burkitt , G. M. Clark . 1999. Analysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike OutputAnalysis of Integrate-and-Fire Neurons: Synchronization of Synaptic Input and Spike Output. Neural Computation 11:4, 871-901. [Abstract] [PDF] [PDF Plus] 11. R. Mueller, A. Herz. 1999. Content-addressable memory with spiking neurons. Physical Review E 59:3, 3330-3338. [CrossRef] 12. Bard Ermentrout. 1998. Reports on Progress in Physics 61:4, 353-430. [CrossRef] 13. U. Ernst, K. Pawelzik, T. Geisel. 1998. Delay-induced multistable synchronization of biological oscillators. Physical Review E 57:2, 2150-2162. [CrossRef] 14. Doron Tal , Eric L. Schwartz . 1997. Computing with the Leaky Integrate-and-Fire Neuron: Logarithmic Computation and MultiplicationComputing with the Leaky Integrate-and-Fire Neuron: Logarithmic Computation and Multiplication. Neural Computation 9:2, 305-318. [Abstract] [PDF] [PDF Plus]
15. Andreas Herz, John Hopfield. 1995. Earthquake Cycles and Neural Reverberations: Collective Oscillations in Systems with Pulse-Coupled Threshold Elements. Physical Review Letters 75:6, 1222-1225. [CrossRef] 16. Wulfram Gerstner. 1995. Time structure of the activity in neural network models. Physical Review E 51:1, 738-758. [CrossRef] 17. Öjvind Bernander , Christof Koch , Marius Usher . 1994. The Effect of Synchronized Inputs at the Single Neuron LevelThe Effect of Synchronized Inputs at the Single Neuron Level. Neural Computation 6:4, 622-641. [Abstract] [PDF] [PDF Plus]
Communicated by Christof Koch
The Effects of Cell Duplication and Noise in a Pattern Generating Network Catherine H.Ferrar Thelma L. Williams Department of Physiology, St. George’s Hospital Medical School, University of London, London, U.K.
Graham Bowtell Department of Mathematics, The City University, London, U.K.
Stability against stochastic variation is an important property for biological systems. This paper investigates the robustness of the rhythmic activity produced by a model of the segmental rhythm generator for locomotion in the lamprey, by introducing stochastic properties into the network. In addition, since neuronal models for vertebrate systems often use a single neuron to represent a large class of cells, this paper explores one of the consequences of such reduction, by investigating the effects of duplicating all the cells of the network on its stability against stochastic variation. We have found the basic model network to be very stable, and have found that this stability is increased by doubling the number of cells in the network. 1 Introduction It is well known that rhythmic behavior in many animals is produced by the activity of autonomously oscillating groups of neurons, known as central pattern generators (CPGs), which require no sensory feedback to provide the basic timing for muscle activation (Delcomyn 1980). CPGs have been described in many invertebrate systems, typically containing a relatively small number of individually identifiable cells. In vertebrates, by contrast, the central nervous system generally consists of classes of cells with similar properties and connections, and in consequence, vertebrate CPG models often use single neurons to represent large classes of neurons. To gain insight into some of the consequences of such simplification, we have experimentally investigated the behavior of a model of the segmental CPG for locomotion in the lamprey. Although the detailed structure of the lamprey locomotor CPG is not known, a small network model consisting of neurons of identified classes known to be rhythmically active during fictive locomotion (Buchanan and Neural Computation 5, 587-5% (1993) @ 1993 Massachusetts Institute of Technology
588
Catherine H. Ferrar, Thelma L. Williams, and Graham Bowtell
Figure 1: Proposed basic network for the lamprey central pattern generator for swimming (Buchanan and Grillner 1987), with driving cells added to each side. C, crossed caudal interneuron (CCIN); E, excitatory interneuron (EIN); I, lateral interneuron (LIN); D, driving cell. Filled circles indicate inhibition and open triangles excitation. Grillner 1987) has been shown to produce oscillations with phase relationships among the three cell types that are similar to those seen in the lamprey spinal cord in vitvo (Grillner et al. 1988; Buchanan 1992; Wallen et al. 1993). In this paper we investigate the stability of this network by randomly varying the strengths of the interneuronal connections. In addition, we have investigated the effects on such stability of doubling the number of neurons in the rhythm generator. 2 Network Model
The model depicted in Figure 1 is based on that of Buchanan and Grillner (1987) containing three types of interneuron (EIN, excitatory interneuron; CCIN, crossed caudal interneuron; LIN, lateral interneuron) forming a network that is symmetric about a center line corresponding to the leftright axis. The EINs provide excitation to motor neurons, and the CCINs
Cell Duplication and Noise in a Pattern Generating Network
589
provide midcycle inhibition. In our simulations, tonic stimulation was increasing the provided to each side of the network by driver cells (D); activity of the driver cells increases the frequency of the oscillations. The initial conditions for the simulations had left/right symmetry, which leads to an unstable equilibrium if no stochastic noise is present. For this reason, a small left/right imbalance (1 in 109 was introduced into the driver cell connections in those simulations that had no noise. This allows the network to begin oscillating; after oscillation has been established such an imbalance can be discarded with no effect on the ensuing output. For networks with even quite low levels of introduced noise (< lo-*), a simulation goes spontaneously into oscillatory behavior with no requirement for imbalance. 3 The Doubled Network
To investigate the effects of doubling the number of neurons in the network, simulations were also run on a network containing twice as many EINs, CCINs, and LINs. The pattern of synaptic connections was the same as in the single network each CCIN made inhibitory synapses on all six contralateral cells (excluding the D cell), each EIN made excitatory synapses on the ipsilateral pairs of CCINs and LINs, and each LIN made inhibitory synapses only on the ipsilateral pair of CCINs. Since the total number of synapses on these cells was thus doubled, the levels of synaptic strengths was halved to allow direct comparison between the two types of network. All cells in the doubled network received the same tonic stimuli from the driver cells as in the single network. 4 Simulation
The basic algorithm for the simulation was that proposed by Grossberg (1978) and embodied in a commercial software package by McClelland and Rumelhart (1988). Each neuron is assumed to have an activation level corresponding to its membrane potential, and synaptic links between neurons are such that one neuron can only affect another neuron if its activation level is above a certain threshold. Although action potentials are not modeled explicitly the level of activation above the threshold can be thought of as representing the frequency of action potentials. The threshhold has been set, without loss of generality, to zero. Mathematically we define [ ~ i ]=
0,
=
Ui,
if if
Ui
~i
50 >0
where ai is the activation level of the ith neuron. The input to the ith postsynaptic neuron from the jth presynaptic neuron, which can be thought
Catherine H. Ferrar, Thelma L. Williams, and Graham Bowtell
590
of as representing a change in conductance, can be expressed as wi,[ai], where wij represents the synaptic strength of the synapse from the jth to the ith neuron. The discrete time course of the activation of the system is now computed from
Aai
=
( M a x - ai)€'xi + (a; - Min)lnhi - Decay(ai - Rest) + di(Max - ai)
where Aai is the change in the activation over one time increment. Exi and lnhi are the sums of all inputs to the ith cell from all presynaptic excitatory and inhibitory cells, respectively, within the network; these are given explicitly as
Exi = C w i j [ ~ ,i ] wij 2 0 i
Inhi = C w i j [ ~ i,l wij < o 1
Max and Min are the limits of depolarization and hyperpolarization,
playing the role of a reversal potential for the excitatory and inhibitory synapses, respectively. Decay is a rate constant controlling the rate of return of the activation to the resting potential Rest, and di is the tonic stimulus received by the ith cell from its driving cell. This network produces oscillations over a large parameter space. For simplicity, in this investigation Max and Min were set equal to +1 and -1, respectively, Decay to 0.1, and Rest to zero. The standard synaptic strengths wi, within the single network were set equal to 1 (positive for excitatory synapses, negative for inhibitory ones). The strengths of the synapses from the D cells to the CCINs was four times higher, at 0.12, than to EINs and LINs, at 0.03. 5 Numerical Investigation
Multiple simulations were run for both the single and double networks, using two different types of variation about the set of standard synaptic strengths. The first was a random variation in the synaptic strengths from the driver cells to all the cells of the network. In this case, each was multiplied by a factor chosen beforehand from a random distribution, to give a range of synaptic strengths from 50 to 100% of the standard values; variations greater than 100% frequently resulted in loss of oscillatory behavior. The synaptic strengths then remained constant during an entire run, and this variation will be referred to as the long time scale type. These simulations were run using a commercially available software package by McClelland and Rumelhart (1988). For the second type of variation, we wrote new software with identical algorithms except that any given synaptic strength could be selected in each time step from a
Cell Duplication and Noise in a Pattern Generating Network
591
pseudorandom distribution. Since a time step represents about 3%of the cycle duration, these variations occur on a time scale closer to that of real synaptic noise. The variations were applied to all the synaptic strengths within the network. With this short time scale variation, it was possible to increase the range of the random distribution of synaptic strengths to 0-200% of the standard values without destroying the stability of the system. In each case, a simulation was run for at least 300 time steps (about 10 cycles of oscillatory activity) or until the system appeared to be in a steady state, before measurements were made. In all runs (except the two runs of the single network in which oscillatory behavior failed, see below), all cells had oscillating membrane potentials (activation values) with the same average cycle durations. For comparison of different runs, the following measurements were made: cycle duration (measured from the times at which the activation level of a particular EIN cell crossed threshold), burst proportion (the fraction of the cycle duration during which the activation of the EIN was above threshold), and left-right phase lag (phase delay between EIN cells on opposite sides of the midline). For the short time scale runs, mean values were calculated from measurements made in 20 consecutive cycles. For the long time scale variation, the means were calculated from twenty independent runs.
6 Results The time course of the activation variables of the left and right EIN cells in one run of the single network without noise is shown in Figure 2A. The oscillator exhibits strict left-right oscillation with a burst proportion of 32%. The cycle duration is approximately 35 time steps. (If a time step is taken arbitrarily to represent 10 msec, the frequency of oscillation is near 3 Hz, a reasonable swimming frequency for a lamprey.) The double network without noise gave identical results. Examples of the effects of noise with a short time scale variation are shown in Figure 2B for the single network and in Figure 2C for the double network. For both these cases the cycle duration, burst proportion, and phase difference between left and right cells are no longer constant. Plots of cycle duration against consecutive run number (long time scale variation) or cycle number (short time scale variation) are shown in Figure 3 for both the single and double networks. It can be seen that the system is less affected by noise on the short time scale; the effects tend to average out over a cycle. For both time scales, doubling the network made the system less variable under the influence of noise. More precisely, Figure 4A shows the mean and standard deviation of these four data sets, from which it is seen that for both short time scale and long time scale variation, doubling the network approximately halves the standard deviations. F tests indicate that the variances of cycle durations
592
Catherine H. Ferrar, Thelma L. Williams, and Graham Bowtell
time
i
time n
-I
time
Figure 2 Outputs from a left (solid line) and a right (dotted line) EIN. (A) Standard single oscillator with no noise. (B) Single oscillator with 100% noise on the short time scale. (C) Double oscillator with 100% noise on the short time scale. The broken horizontal line represents the threshold level.
Cell Duplication and Noise in a Pattern Generating Network
-d
400
3 8
350 300
5 .-”
250 200
6
0
-U, 0
450
150 100 50 0
593
..
1 -1
L 5
10
15
20
15
20
cycle no.
-d 1 450
400
0
5
10
run no.
15
20
0
5
10
cycle no.
Figure 3 (A) Cycle duration plotted against run number for a single network with long time scale variation. (B) As in (A) but for the double network. (C) Cycle duration plotted against cycle number for the single network with short time scale variation. (D) As in (C) but for the double network. The solid horizontal line indicates the mean cycle duration. One computational time step taken arbitrarily as representing 10 msec. are significantly different in the single and double networks with p < 0.01 for the long time scale variation, and p < 0.05 for the short time scale variation. The means and standard deviations for the left-right phase difference are shown in Figure 4B; the mean was almost equal for each of the four cases. Statistically, it was found that doubling the network makes a significant change in the variability only when considering the long time scale cases. Figure 4C shows that doubling the network significantly reduces the variation of the burst proportion for long time scale variation, but not for short time scale variation. The standard deviations shown in Figure 4 for the single network with the long time scale (open squares) underestimate the variability in the following sense: in 2 of 22 runs with this configuration, the behavior of the network was oscillatory for only a few cycles, after which a nonpe-
Catherine H. Ferrar, Thelma L. Williams, and Graham Bowtell
594
A
C
B
-E -
400
0
0.5
0.75
e
2 aI
rc
c
.-0 U
200
a
Q)
T1
-
z
c
0.50
(II
c
a
0
0.25
0
bn 2
0.25
c
cn L
a
L
r
C
a
C
.-
0
.-0
n
I .U c
0
2
0
0
0 mean of single network, variation over the short timescale. 0 mean of single network, variation over the long timescale. 0 mean of double network, short timescale.
mean of double network, long timescale.
Figure 4 Means and standard deviations of (A) cycle durations in the EINs (one time step represents 10 msec), (B)phase differences between left and right Ems, and (C) burst proportions of Ems. riodic solution was reached, in which some cells were silent while others were tonically active. In these cases, in other words, the set of synaptic strengths was not within the parameter space in which oscillatory behavior is supported. No measurements were made on these runs. To test whether further duplication of the network would reduce variability, the standard deviation of cycle duration was calculated for 4-, 8-, and 16-unit networks (Fig. 5). It can be seen that the variability continues to decrease. 7 Discussion
To appreciate the extent of the simplification made in this model of the lamprey CPG for locomotion, it should be noted that in each segment of the lamprey spinal cord there are approximately 60 CCINs (Buchanan 1982) and at least 40 EINs (Buchanan et al. 1989). On the other hand, only 50-150 LINs are estimated to occur in the entire spinal cord (approximately 100 segments), and to be restricted to the rostra1 half of the cord. This allows for fewer than 2 LINs per segment, but the role of LINs
Cell Duplication and Noise in a Pattern Generating Network
1
2
4
8
595
16
Number of replicates
Figure 5: Effects of further network duplication on variability. Ordinate: standard deviation of the cycle duration, expressed as fraction of the mean.
in this circuit may actually be filled by the small, numerous inhibitory interneurons reported by Buchanan and Grillner (1988)to have a strong influence on the locomotor activity produced by the spinal cord in vitro. In spite of such uncertainty, the robustness of this oscillatory network in the face of even quite large random variations in synaptic current indicates a strongly attracting limit cycle, making the circuit a worthy candidate for the role of unit oscillator in the lamprey CPG. The finding that the behavior of the model network became even more stable when the number of neurons in each class was increased confirmed our intuitive prediction that it would benefit a network to have many neurons of the same class, since the random variation that occurs in a biological system would be less disturbing. From a modeling point of view it would appear that representing a class of cells by a single cell may not affect the behavior of the network as long as the reduced model is stable enough.
596
Catherine H. Ferrar, Thelma L. Williams, and Graham Bowtell
Acknowledgments This work was supported by the SERC with additional funding (for CHF) from Glaxo and from the William Rushton Fund of the Physiological Society. We are grateful to Jim Buchanan for introducing us to the McClelland and Rumelhart software and for generously sending us his unpublished results.
References Buchanan, J. T. 1982. Identification of interneurons with contralateral caudal axons in the lamprey spinal cord: Synaptic interactions and morphology. J . Neurophysiol. 47,961-975. Buchanan, J. T. 1992. Neural network simulations of coupled locomotor oscillators in the lamprey spinal cord. Biol. Cybernetics 66,367-374. Buchanan, J. T., and Grillner, S. 1987. Newly identified "glutamate interneurons" and their role in locomotion in the lamprey spinal cord. Science 236, 312-314. Buchanan, J. T., and Grillner, S. 1988. A new class of small inhibitory interneurons in the lamprey spinal cord. Brain Res. 438,404407. Buchanan, J. T., Grillner S., Cullheim, S., and Risling, M. 1989. Identification of excitatory interneurons contributing to generation of locomotion in lamprey: Structure, pharmacology and function. I. Neurophysiol. 62(1),59-69. Delcomyn, F. 1980. Neural basis of rhythmic behaviour in animals. Science 210, 492498. Crillner, S.,Buchanan, J. T., and Lanser, A. 1988. Simulation of the segmental burst generating network for locomotion in the lamprey. Neurosci. Lett. 89, 31-35. Crossberg, S. 1978. A theory of visual coding, memory, and development. In Formal Theories of Visual Perception, E. L. J. Leeuwenberg and H. F. J. M. Buffart, eds., pp. 7-26. Wiley, New York. McClelland, J. L., and Rumelhart, D. E. 1988. Explorations in Parallel Distributed Processing: A Handbook of Models, Programs and Exercises, 1st ed. MIT Press, Cambridge, MA. Rovainen, C. M. 1979. Neurobiology of lampreys. Physiol. Rev. 59, 1007-1077. Wallen, P., Ekeberg, O., Lansner, A., Brodin, L., Traven, H., and Grillner, S. 1993. A computer based model for realistic simulations of neural networks 11: The segmental network generating locomotor rhythmicity in the lamprey. I. Neurophysiol. 68,1939-1950. Received 15 April 1992; accepted 6 January1993.
This article has been cited by: 2. Jimmy Or. 2007. Robustness of Connectionist Swimming Controllers Against Random Variation in Neural ConnectionsRobustness of Connectionist Swimming Controllers Against Random Variation in Neural Connections. Neural Computation 19:6, 1568-1588. [Abstract] [PDF] [PDF Plus]
Communicated by Richard Andersen
Emergence of Position-Independent Detectors of Sense of Rotation and Dilation with Hebbian Learning: An Analysis Kechen Zhang Martin I. Sereno Margaret E. Sereno* Department of Cognitive Science, University of California,Sun Diego,
La Jolla,CA 92093-0515 USA We previously demonstrated that it is possible to learn position-independent responses to rotation and dilation by filtering rotations and dilations with different centers through an input layer with MT-like speed and direction tuning curves and connecting them to an MSTlike layer with simple Hebbian synapses (Sereno and Sereno 1991). By analyzing an idealized version of the network with broader, sinusoidal direction-tuning and linear speed-tuning, we show analytically that a Hebb rule trained with arbitrary rotation, dilatiodcontraction, and translation velocity fields yields units with weight fields that are a rotation plus a dilation or contraction field, and whose responses to a rotating or dilatinglcontracting disk are exactly position independent. Differences between the performance of this idealized model and our original model (and real MST neurons) are discussed. 1 Introduction A major stream of motion information processing in the primate visual system goes from layer 48 in primary visual cortex (Vl) to the middle temporal area (MT) and then to the medial superior temporal area (MST) (for reviews see Sereno and Allman 1991; Felleman and Van Essen 1991). Most neurons in area MT have moderate-sized receptive fields, and a subset is tuned to the local pattern velocity (Movshon et al. 1985). Neurons in the dorsal part of MST, by contrast, have much larger receptive fields and some are selective to higher order motion features-for example, rotation (either clockwise or counterclockwise, but not both), and dilation or contraction (but not both) on the frontoparallel plane (Saito et al. 1986; Sakata et al. 1986; Tanaka and Saito 1989; Tanaka et al. 1989; Duffy and Wurtz 1991a,b). Detecting rotation, dilation, and contraction provides useful information about an animal's motion relative to the environment or about the intrinsic motion of an object (Koenderink and 'Present address: Department of Psychology, University of Oregon, Eugene, OR 97403.
Neural Computation 5,597-612 (1993) @ 1993 Massachusetts Institute of Technology
598
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
van Doorn 1975, 1976; Longuet-Higgins and Prazdny 1980; Koenderink 1986). An interesting property is that some dorsal MST neurons give nearly identical responses to a rotation, or dilation or contraction, no matter where the center of the velocity flow is located. We sought to find a neural mechanism for this position invariance. To be selective to a rotation or dilation/contraction with a fixed center, the receptive field of an MST neuron need just be composed of the MT neurons whose preferred directions are arranged circularly or radially around that center (Saito et al. 1986). At first glance, this simple mechanism would not seem to be able to support an invariant response when the position of the center changes (Saito et al. 1986; Tanaka et al. 1989; Duffy and Wurtz 1991b). Two previous proposals for a position-independent mechanism assume a homogeneous organization for an MST receptive field to ensure that all its subfields have identical structure and function. In one model, the local rotation and dilation of the velocity field is first derived and then summed up across space to get invariant responses (Duffy and Wurtz 1991b). This algorithm requires that MT neurons be selective to local rotation and dilation/contraction, which is generally not the case (Tanaka et al. 1986). Another model makes use of partially overlapping compartments in an MST receptive field (Saito et al. 1986). But this model needs a special surround effect in MT neurons to prevent many compartments from being activated simultaneously, the exact mechanism of which awaits further experimental proof. A simpler yet counterintuitive solution was discovered in a computer simulation experiment using a feedforward network and unsupervised learning (Sereno and Sereno 1990,1991). That work was based on a previous study in which Hebbian learning was used to find a solution to the aperture problem in a two-layer feed forward network corresponding to the connections from V1 -+ MT (Sereno 1989). When a similar network (with a larger interlayer divergence) representing MT MST connections is trained with rotation, dilation, and contraction using a Hebb rule and input-layer units with MT-like tuning curves, MST-like units with position-independent responses emerge. Surprisingly, such rotation or dilation/contraction detectors turned out to have inhomogeneous receptive fields with a circular, spiral, or radial arrangement of local direction selectivity, just as in the simple mechanism mentioned before. In this letter we analyze a modified version of the original model in Sereno and Sereno (1991). The input layer of the modified model has broader (cosine) tuning curves than in the original model (and broader than those of real MT neurons), but it allows us to derive explicit expressions for the course of learning empirically observed in the original model. The modified model gives rise to MST-like units that linearly decompose the flow field into flow field components-for example, a clockwise rotation-preferring unit will respond as well to the rotation in a clockwise spiral as to a pure clockwise rotation, ignoring any added di-+
Position-Independent Detectors of Rotation and Dilation
599
lation/contraction. By contrast, in our original model, the sharper tuning curves for the MT-like units result in MST-like units whose response falls off as other optic flow components are added. This smooth fall-off has also been observed with real MST neurons (Graziano et al. 1990; Orban et al. 1992). It is important to note, however, that the basic mechanism of position-invariant response to flow field stimuli (a position-variant direction-tuning template) is identical in both the idealized model with cosine tuning curves as well as the original model with narrower tuning curves. Linear decomposition might yet be found in areas beyond MST. It would be useful for filtering out certain movement components (e.g., translation) while exactly signaling the magnitude of others of interest (e.g., dilation). On the other hand, tighter input-layer tuning curves allow individual output layer units to code more information about a flow field (see Discussion).
2 A Mechanism for Position Independence First, we show the basic principles for the position-independent responses, as initially revealed by computer simulation (and recently independently derived in similar form by Poggio et al. 1990, 1991). Let v = v(r) be the velocity field on the image plane (frontoparallel plane), where the vector r = xi + yj denotes the position with i and j being the unit vectors of x and y axes. Consider MT-like units that are sensitive to the local stimulus velocities v. Each MT-like unit has a preferred direction. Given v as the stimulus velocity at a fixed position, the response or activation a of the MT-like unit at that position is assumed to be proportional to the velocity component in the preferred direction, or
a = cv cos(8 - 4) = cd, . v
(2.1)
where z, = IvI is the stimulus speed, c is a constant coefficient representing the slope of the (linear) speed tuning, 0 is the direction angle of v, and d# is the unit vector for the preferred direction angle 4. In other words, the unit has linear response to the speed v and a sinusoidal direction tuning curve with the maximum at the preferred direction 4 (Fig. 1). The artificial MT-like units resemble the real neurons in area MT of monkey in certain respects (Rodman and Albright 1987). For most MT neurons, speed does not alter the shape of direction tuning curves, which implies a multiplicative interaction of the speed tuning and the direction tuning as used in expression 2.1. The linear speed tuning is a reasonable approximation for small speeds, although the response is sometimes reduced when the speed exceeds an optimum value. The sinusoidal direction tuning is broader than a typical real MT neuron (Maunsell and Van Essen 1983). Also, the responses of real MT neurons to the antipreferred direction are usually smaller. We retain expression 2.1 for its simplicity and ease of analysis.
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
600
Figure 1: A family of direction tuning curves for different speeds. Now consider an MST-like unit that receives inputs from many MTlike units. It is convenient to define the weight vector field of an MST-like unit. For any position on the image plane, we define the weight vector w at that point as
w = cwd4
(2.2)
where d4 is the preferred direction of the MT-like unit at that position, w is the scalar weight for its connection to the MST-like unit, and c is the same constant coefficient as in 2.1. The total input I to the MST-like unit is assumed to be the weighted sum of the inputs from all the MT-like units within the receptive field of the MST-like unit:
I = C wa = C w . v r
(2.3)
r
where the simple relation (see equations 2.1 and 2.2) wa = wcd4 .v= w ’ v has been used. The output of the MST-like unit is simply 0 = o(1)
where o( ) is a sigmoid function. If the weight vector field of an MST-like unit is itself a rotational field, namely, w = n x r = -Oyi+Rxj
(2.4)
where vector n = nk can be regarded as the “angular velocity” for the weight vector field w, with k being the unit vector of the t axis (perpendicular to the image plane), then we can prove that the MST-like unit’s response 0 to a rotating disk of angular velocity w = wk depends only on the angular speed w of the stimulus but not on the location of the stimulus disk.
Position-IndependentDetectors of Rotation and Dilation
601
Figure 2: Response elicited by a rotating (a) or dilating (b) ring in a receptive field with circular or radial distribution of direction selectivity is independent of the position of the ring (see text).
To show this, decompose the stimulus disk into many concentric rings and calculate the response elicited by a single rotating ring of radius R and width AR (Fig. 2a). Let v be the velocity field of the stimulus ring, and AZ = C v be the increment to the total input to the MST-like unit contributed by all the MT-like units within the area covered by the ring. We treat the weight vector field w = w(r) as depending continuously on the position r. The number of units for a unit area on the image plane is assumed to be a constant, and is taken as unity for simplicity. Replacing the sum by the integration along the ring, we get we
where dl = (v/u)dl with v/v being the unit vector in the circular direction, and S is the area enclosed by the ring. The last equality is a direct application of Stokes' theorem, where dS is the area element. Since the weight vector field 2.4 has a constant curl V x w = (&w,,/i3x-&wx/dy)k = 2Qk, the last integral in equation 2.5 is equal to Js 2R dS = 27rRR2. Since the rotational speed D of the stimulus ring is proportional to its radius (u = wR), we finally obtain
Al = 2rwRR3AR
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
602
which is independent of the position of the stimulus ring. The total response to the disk of radius p is therefore
o = o(i)=
(nwnp4/2)
which is also position independent. If the angular speed w changes sign, that is, the disk rotates in the opposite direction, the total input I also changes sign. To get position-independent responses to dilation or contraction, just let the weight vector field be
w = Ar = Axi + Ayj
(2.6)
which is itself a "dilation" when constant A > 0 and a "contraction" when A < 0. This vector field has constant divergence V .w = aw,/ax + awy/ay = 2A (cf. the expression for constant curl above, except that div is a scalar). The response elicited by a dilating ring (Fig. 2b) is independent of the position of its center. The proof is similar, but Gauss' theorem is used to evaluate the integral:
fw.vdl=v
f w . n d l = v J, V.wdS=2nXAR3
where n = v/v is the unit vector in the radial direction of the ring, and v = XR, where X specifies the rate of dilation (as w specifies the rotation speed). It follows that the response to a dilating disk is also position independent. In the case that the stimulus is a contraction, the input just changes sign. This result can be intuitively appreciated by considering Figure 3. In (a), the stimulus is centered. Since the local stimulus direction v(r) always agrees with the weighted local preferred direction w4(r) in the receptive field, the dot product between each pair is positive, though small. In (b), with the stimulus center situated to the right of the receptive field center, local direction selectivity and local stimulus direction clash near the center of the receptive field-the dot products there are actually negative; but the negative terms are compensated, exactly, as we have seen, by the larger positive dot products in the periphery of the receptive field. 3 Development of Weight Vector Field under a Hebb Rule
Now we consider the general manner in which weight vector field changes during Hebbian learning. At each position on the image plane we use a set of MT-like units with different preferred directions. Let w4(r) denote the weight of the MT-like unit at position r with preferred direction 4. As before, the response or activation of the MT-like unit to the velocity field v(r) is
a&)
= c4(r)d4*
v(r)
(3.1)
Position-Independent Detectors of Rotation and Dilation
-
L
603
v(r) local stimulus direction w(r) local wei hted direction-selectivity in M!h receptive field
Figure 3: An intuitive interpretation of the mechanism for position independence. Negative dot products near the center of the receptive field in (b) are compensated by larger ones peripherally to give the same sum as in (a).
Figure 4 Each unit in the MST layer receives inputs from MT-like units at different positions and with different preferred directions (indicated by arrows). and the weight vector is defined as w4(r) = cg(r)w+(r)dg,with dg again being the unit vector for the preferred direction of an MT-like unit. The total input Z to the MST-like unit is the weighted s u m of the responses from all MT-like units within the receptive field (Fig. 41, i.e., summing
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
604
over different positions in the receptive field as well as different preferred directions: (3.2)
where the identity w4(r)a4(r)= w4(r)c4(r)d4v(r) = w4(r). v(r) is used. We can treat the system as if there were only one MT-like unit specified by the weight vector at each position (call this the equivalent weight vector) w(r) :=
cwdr) c c4(r)w4(r)d4 =
4
4
so we can write equation 3.2 as
I = C w(r) v(r)
(3.3)
r
which is exactly the same as expression 2.3 in the previous section. As before, the output of the MST-like unit is 0 = u(1)
Suppose the increment of the weight in each training step follows a simple Hebb rule Aw4(r)= cu+(r)O
(3.4)
In the present model no explicit distinction has been made between excitatory and inhibitory synaptic connections, and the weights are allowed to change sign. Since w(r) = C0c4(r)w4(r)d4,the corresponding increment of the equivalent weight vector field is Aw(r) = Cc+(r)Aw+,(r)d4=
C [a$(r)d4 v(r)O] d4
4
4
where the last equality is obtained by substituting equation 3.1 into 3.4. The coefficient c4(r) is assumed to be 5 random variable with uniform distribution across the angle 4. Let c2 be the average of c$(r). It is assumed to be a constant across the image plane. As an approximation for large number of units, we have Aw(r) = c?* C [ d +. cOv(r)]d+
(3.5)
4
To simphfy this expression, first note that if there are n(> 3) unit vectors (d4) distributed evenly around the unit circle, then C ( d 4 V)d+ = nV 2 4
(3.6)
Position-IndependentDetectors of Rotation and Dilation
605
holds for all vector V. For a proof, write each vector as a complex number, namely, V = @" and d+ = e@, where 8 is the direction angle of V and p = )V1is the radius. Because d6 . V = pcos(8- 4) = &I (ei(@-6)+ e-@-6) is a real number, we have
1
Since is evenly distributed around the circle, C6e2'@ = 0. This proves equation 3.6. Assuming that the preferred directions of the MT-like units at each spatial position arepvenly distributed, we can employ formula 3.6 by identifying V with c2dv(r)so that equation 3.5 can be rewritten as 1 Aw(r) = -nec20v(r) (3.7) 2 where n is the number of the MT-like units at each position. This increment is caused by a single training step with the velocity field v(r). After training with a sequence of velocity fields, the equivalent weight vector field adds up to
w(r) = wo(r) + - n d Ov(r) 2 f I
-
C
(3.8)
where t(= 0,1,2,.. .) stands for all time steps in the training and wo(r) is the initial weight vector. In conclusion, the final equivalent weight vector field is just proportional to the sum of the training velocity fields weighted by the corresponding responses of the MST-like unit. 4 Training with Translation, Rotation, Dilation, and Contraction __
We are now ready to consider the training with translation, rotation, dilation, and contraction velocity fields. To begin with, suppose for a single training step the velocity field is a rotation centered at I, v(r) = w x (r - rc)
(4.1)
and in different steps both the angular velocity w and the center r, vary randomly. Substituting equation 4.1 into 3.8 and ignoring the initial weight vector for its smallness, we obtain the final weight vector field w(r) = qCOv(r) = q t
where v field
=
na?/2 is a constant. This can be identified with the rotational
w(r) = n x (r - ro)
(4.2)
606
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
+
Figure 5: The final weight vector field is generally composed of a rotation field (a) and a dilation or contraction field (b). The result (c) is a spiral field. where the weight field "angular velocity" n and the weight field center ro are defined by n := r] Et Ow and n x ro := r] Ct(Ow x rc). The latter equation has a unique solution of ro as long as n # 0. In the special case n = 0, w(r) is a constant vector field (translation). Similarly, training with dilation or contraction v(r) = X(r - rc)
with rate X and center r, varying in time will lead to the final weight vector field
w(r) = A(r - ro)
(4.3)
where A := r] E, OX and Aro := 17 C, OXr,. This is either a dilation (A > 0) or a contraction (A < 0). In the special case A = 0, w(r) is a constant (translation). Note that expressions 4.2 and 4.3 are just what are required for position-independent responses (cf. equations 2.4 and 2.6). It should be realized that the center ro does not affect the curl and divergence of a vector field, and thus does not affect our previous conclusions. For training with a mixture of translations, rotations, dilations, and contractions, it is readily shown by similar argument that the final weight vector field takes the form
w(r) = n x (r - a)
+ A(r - b) + c
It can always be written equivalently as
w(r) = n x (r - ro)
+ A(r - ro)
which is a spiral centered at ro (Fig. 5). An MST-like unit with a spiral
Position-IndependentDetectors of Rotation and Dilation
607
weight vector field has position-independent responses to a particular sense of rotation as well as to either a dilation or contraction. Even if the training velocity fields have a zero average, for example, clockwise and counterclockwise rotations have an equal chance of appearing, the weight vector field is still expected to grow with time. We consider the simple case where all rotations and dilationslcontractions are centered at the same point so that the development of the two corresponding components is strictly independent. Consider the initial stage of development for training with, say, rotation fields alone. Now we need consider only the linear range of the sigmoid function c, and for simplicity we assume 0 = I. According to equations 3.7 and 3.3, at time step t 1
+
where the subscripts refer to time. Thus
It can be expressed as Ot+l = AfWt+l
+ 77(AQ,wt)(Awtw*+1)
(4.4)
where at and wf are the angular speeds for the vector fields wt and vf at time t, respectively, and A is a constant depending on the size and shape of the receptive field as well as the position of the rotation center. Imagine an ensemble of parallel training sessions starting from different initial weights and using different rotation sequences of random angular speeds, which are independent of each other while having identical statistics. We take the ensemble average on both sides of equation 4.4 to get (Ot+l)= A (a,) (u)+qA2 (Q,) (w’) (w), where the subscript for the angular speed w is dropped because the statistics of w does not change over time. If (w)= 0, then (Ot+~) = 0 for all t. However, taking the ensemble average after squaring equation 4.4 and using (q) = A2 (0:) (w2), we can obtain (4.5)
where & := A ( w 2 ) = ( Z r v.v) Q
:=
(w4)/(3>’
are constants. When w is drawn from a gaussian distribution of zero
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
608
mean, for instance, (r = 3. Applying equation 4.5 iteratively yields (q+l) = e’/r (q), where constant T := I /
ln(1 + 27/f + tr$€*)
+
Because 0 1 = C, w1 . v1 and w1 = wo qOovo= ;rlOovo= r/(C wo . vo)vo, by similar arguments as above we can get (@) = (rr?€’ (0;): Hence (0:)= f$C*e(‘-1)/‘
(03
(4.6)
Consequently, when many MST-like units develop in parallel starting from random initial weights, the responses (either positive or negative) to rotation and to dilation/contraction are expected to grow exponentially in the initial stage of development. The variety of the initial responses leads to a continuous spectrum of selectivity to rotation and dilation/contraction, which is what has actually been found in the neurophysiological experiments (Duffy and Wurtz 1991a; Andersen et al. 1991). 5 Discussion
The model provides a unified, albeit simplified, account for several essential properties of MST neurons and how they might develop. These properties include selectivity to rotation, dilation, and contraction, the position independence of the responses (Saito et al. 1986; Tanaka and Saito 1989; Tanaka et al. 1989; Duffy and Wurtz 1991a,b), the selectivity to spiral velocity fields (Graziano et al. 1990; Andersen et al. 1991), and the continuous spectrum of selectivity (Duffy and Wurtz 1991a; Andersen et al. 1991). The model’s response saturates at higher speeds (as a result of the sigmoid function) as does the response of real neurons (Orban et al. 1992). In addition to rotation and dilation/contraction, shear also naturally arises in the optic flow (Koenderink 1986). Since the linear combination of shear fields is still a shear, according to equation 3.8 the weight vector field itself will also have a shear component. Consistent with the model, neurons selective to shear components were also found in the cortical areas including MST (Lagae et al. 1991). This model differs somewhat from the original model in Sereno and Sereno (1991)and from real MST neurons in that it “linearly decomposes’’ the velocity field-that is, an MST-like unit will respond exclusively to the, say, rotational component of a flow field, regardless of the magnitude of the radial component. Since a cosine tuning curve means that the input unit sees exactly the vector component of the local stimulus movement in the preferred (here rotational) direction, it leads to linear decomposition. With narrower tuning curves, the response of individual MST-like units provides more information about the exact composition of the flow field-for example, the extent to which it approximates a pure
Position-Independent Detectors of Rotation and Dilation
609
rotation; nevertheless, approximate position independence with narrower tuning curves is still explained by a direction-template mechanism like that described above. Roughly speaking, learning with a simple Hebb rule tends to maximize the total response by gradient ascent and thus tune the net to the input patterns that frequently occur. Consider the output 0 = o(Z) = u ( F w ; l ; )
The Hebb rule Awi a Z;O is always of the same sign as the gradient of the function E
aE
=
:02:
ao aw;= Z;Oo’(l)
- = 0-
aw;
because the derivative o‘ is always positive. As a consequence, there should be a general tendency for local direction selectivity to be aligned with the direction of the stimulus velocity. Recently, it was demonstrated that although dilation-sensitive MSTd neurons are basically position invariant in their responses, they often respond best to dilations centered at a particular location in the receptive field (often not the receptive field center) (Duffyand Wurtz 1991~).Similar results were obtained in the simulations in Sereno and Sereno (1991) using MT-like (narrower) input-layer tuning curves. It may be advantageous to retain information about combinations of flow field componentshere, dilation and translation-in single units since these combinations can have particular behavioral relevancefor example, in signaling direction of heading (Perrone 1992). More realistic peaked (instead of linear) speed tuning curves (Maunsell and Van Essen 1983) in the MT-like input layer could potentially sharpen the response to particular flow components since local speeds may be changed from the optimum as flow field components are added. Cross-direction inhibition (known to occur in MT: Snowden et al. 1991) could also be incorporated, effectively deleting portions of the flow field containing conflicting local motion signals. This could improve performance with more complex, real-world motion arrays. The rotation, dilation, and contraction velocity fields required for training are readily produced when an animal is moving around in a rigid environment. Exposure to such velocity fields may be crucial in order for a young animal to develop rotation and dilation cells in its visual system. Human babies, for instance, can distinguish a rotation field from a random velocity field only after several months of visual experience (Spitz et al. 1988). This could be tested by recording from MST in infant monkeys.
610
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
Feedforward networks using Hebb rules have been shown to be capable of producing detectors selective to a hierarchy of features like those found in the successive stages of visual processing: center-surround units like those in the LGN (Linsker 1986a),orientation-selective units like simple cells in V1 (Linsker 1986b), pattern motion units like some cells in MT (Sereno 1989), and finally position-independent rotation and dilation units like cells in dorsal MST (Sereno and Sereno 1991). The visual system may use simple local learning rules and a richly textured environment to build up complex filters in stages. This strategy could drastically reduce the amount of supervision that is required later on (cf. Geman et al. 1992) as the visual system learns to recognize objects and direct navigation and manipulation.
Note Added in Proof Recently, Gallant et al. (1993) found that neurons in V4 respond selectively, and in a position-invariant way to static patterns containing concentric, radiating, shearing, or spiral contours. The main outlines of our analysis could be extended to explain the selectivity and development of these neurons by substituting an orientation-selective input layer for the direction-selective input layer considered here.
Acknowledgments
M. E. S. was supported by a postdoctoral fellowship, K. Z. by a graduate fellowship, and M. I. S. by a research award from the McDonnell-Pew Center for Cognitive Neuroscience at San Diego. We thank an anonymous reviewer for helpful comments.
References Andersen, R., Graziano, M., and Snowden, R. 1991. Selectivity of area MST neurons for expansion/contraction and rotation motions. Invest. Opkthal. Vis.Sci., Abstr. 32, 823. Duffy, C. J., and Wurtz, R. H. 1991a. Sensitivity of MST neurons to optic flow stimuli. I. A continuum of response selectivity to large-field stimuli. J. Neurophysiol. 65, 1329-1345. Duffy, C. J., and Wurtz, R. H. 1991b. Sensitivity of MST neurons to optic flow stimuli. 11. Mechanisms of response selectivity revealed by small-field stimuli. J. Neurophysiol. 65, 1346-1359. Duffy, C. J., and Wurtz, R. H. 1991~.MSTd neuronal sensitivity to heading of motion in optic flow fields. SOC. Neurosci. Abstr. 17, 441. Felleman, D., and Van Essen, D. C. 1991. Distributed hierarchical processing in primate cerebral cortex. Cerebral Cortex 1, 147.
Position-Independent Detectors of Rotation and Dilation
611
Gallant, J. L., Braun, J., and Van Essen, D. C. 1993. Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex. Science 259, 100-103. Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Comp. 4, 1-58. Graziano, M. S. A,, Andersen, R. A., and Snowden, R. 1990. Stimulus selectivity of neurons in macaque MST. SOC.Neurosci. Abstr. 16, 7. Koenderink, J. J. 1986. Optic flow. Vision Res. 26, 161-180. Koenderink, J. J., and van Doorn, A. J. 1975. Invariant properties of the motion parallax field due to the movement of rigid bodies relative to an observer. Opt. Acta 22, 773-791. Koenderink, J. J., and van Doorn, A. J. 1976. Local structure of movement parallax of the plane. J . Opt. SOC.Am. 66, 717-723. Lagae, L., Xiao, D., Ralguel, S., Maes, H., and Orban, G. A. 1991. Position invariance of optic flow component selectivity differentiates monkey MST and FST cells from MT cells. Invest. Ophthal. Vis. Sci., Abstr. 32, 823. Linsker, R. 1986a. From basic network principles to neural architecture: emergence of spatial-opponent cells. Proc. Natl. Acud. Sci. U.S.A. 83, 7508-7512. Linsker, R. 1986b. From basic network principles to neural architecture: emergence of orientation-selective cells. Proc. Natl. Acad. Sci. U.S.A.83,8390-8394. Longuet-Higgins, H. C., and Prazdny, K. 1980. The interpretation of a moving retinal image. Proc. R. SOC.London B 208,385-397. Maunsell, J. H. R., and Van Essen, D.C. 1983. Functional properties of neurons in middle temporal visual area (MT) of the macaque monkey: I. Selectivity for stimulus direction, speed and orientation. 1.Neurophysiol. 49, 1127-1147. Movshon, J. A., Adelson, E. H., Gizzi, M. S., and Newsome, W. T. 1985. Analysis of moving visual patterns. In Pattern Recognition Mechanisms, C. Chagas, R. Gattass, and C. Gross, eds., pp. 117-151. Springer-Verlag, New York. Orban, G. A., Lagae, L., Verri, A., Raiguel, S., Xiao, D., Maes, H., and Torre, V. 1992. First-order analysis of optical flow in monkey brain. Proc. Natl. Acad. Sci. U.S.A. 89, 2595-2599. Perrone, J. A. 1992. Model for the computation of self-motion in biological systems. J . Opt. SOC.Am. A 9, 177-194. Poggio, T., Verri, A., and Torre, V. 1990. Does cortical area MST know Green theorems? Instituto per la Ricerca Scientifica e Technologica Tech. Rep. No. 900807,143. Poggio, T., Verri, A., and Torre, V. 1991. Green theorems and qualitative p r o p erties of the optical flow. MIT A.I. Memo, No. 1289, 1-6. Rodman, H. R., and Albright, T. D. 1987. Coding of visual stimulus velocity in area MT of the macaque. Vision Res. 27,2035-2048. Saito, H., Yukie, M., Tanaka, K., Hikosaka, K., Fukada, Y., and Iwai, E. 1986. Integration of direction signals of image motion in the superior temporal sulcus of the macaque monkey. 1.Neurosci. 6, 145-157. Sakata, H., Shibutani, H., Ito, Y., and Tsurugai, K. 1986. Parietal cortical neurons responding to rotary movement of visual stimulus in space. Exp. Brain Res. 61, 658-663. Sereno, M. I. 1989. Learning the solution to the aperture problem for pattern
612
Kechen Zhang, Martin I. Sereno, and Margaret E. Sereno
motion with a Hebb rule. In Advances in Neural Information Processing System I, D. S. Touretzky, ed., pp. 468-476. Morgan Kaufmann, San Mateo, CA. Sereno, M. I., and Allman, J. M. 1991. Cortical visual areas in mammals. In The Neural Basis ofvisuul Function, A. G . Leventhal, ed., pp. 160-172. Macmillan, London. Sereno, M. I., and Sereno, M. E. 1990. Learning to discriminate senses of rotation and dilation with a Hebb rule. Invest. Ophthl. Vis. Sci., Abstr. 31, 528. Sereno, M. I., and Sereno, M. E. 1991. Learning to see rotation and dilation with a Hebb rule. In Advances in Neural Information Processing Systems 3, R. P. Lippmann, J. Moody, and D. S. Touretzky, eds., pp. 320-326. Morgan Kaufmann, San Mateo, CA. Snowden, R. J., Treue, S., Erickson, R. G., and Andersen, R. A. 1991. The response of area MT and V1 neurons to transparent motion. J. Neurosci. 11, 2768-2785. Spitz, R. V., Stiles-Davis, J., and Siegel, R. M. 1988. Infant perception of rotation from rigid structure-from-motion displays. SOC.Neurosci. Abstr. 14, 1244. Tanaka, K., and Saito, H.-A. 1989. Analysis of motion of the visual field by direction, expansion/contraction, and rotation cells clustered in the dorsal part of the medial superior temporal area of the macaque monkey. 1. Neurophysiol. 62, 626-641. Tanaka, K., Hikosaka, K., Saito, H.-A., Yukie, M., Fukada, Y., and Iwai, E. 1986. Analysis of local and widefield movements in the superior temporal visual areas of the macaque monkey. J.Neurosci. 6, 134-144. Tanaka, K., Fukada, Y., and Saito, H.-A. 1989. Underlying mechanisms of the response specificity of expansion/contraction and rotation cells in the dorsal part of the medial superior temporal area of the macaque monkey. J. Neurophysiol. 62, 642-656. Received 11 October 1991; accepted 11 January 1993.
This article has been cited by: 2. S. Pitzalis, M.I. Sereno, G. Committeri, P. Fattori, G. Galati, F. Patria, C. Galletti. 2010. Human V6: The Medial Motion Area. Cerebral Cortex 20:2, 411-424. [CrossRef] 3. Peter J. Bex, Andrew B. Metha, Walter Makous. 1998. Psychophysical evidence for a functional hierarchy of motion processing mechanisms. Journal of the Optical Society of America A 15:4, 769. [CrossRef] 4. Ruye Wang . 1995. A Simple Competitive Account of Some Response Properties of Visual Neurons in Area MSTdA Simple Competitive Account of Some Response Properties of Visual Neurons in Area MSTd. Neural Computation 7:2, 290-306. [Abstract] [PDF] [PDF Plus]
Communicated by Andrew Barto
Improving Generalization for Temporal Difference Learning: The Successor Representation Peter Dayan Computational Neurobiology Laboratory, The Salk Institute, P.0.Box 85800, Sun Diego, CA 92186-5800 USA Estimation of returns over time, the focus of temporal difference (TD) algorithms, imposes particular constraints on good function approximators or representations. Appropriate generalization between states is determined by how similar their successors are, and representations should follow suit. This paper shows how TD machinery can be used to learn such representations, and illustrates, using a navigation task, the appropriately distributed nature of the result. 1 Introduction
The method of temporal differences (TD, Samuel 1959; Sutton 1984,1988) is a way of estimating future outcomes in problems whose temporal structure is paramount. A paradigmatic example is predicting the long-term discounted value of executing a particular policy in a finite Markovian decision task. The information gathered by TD can be used to improve policies in a form of asynchronous dynamic programming (DP; Watkins 1989; Barto et al. 1989; Barto et al. 1991). As briefly reviewed in the next section, TD methods apply to a learning framework, which specifies the goal for learning and precisely how the system fails to attain this goal in particular circumstances. Just like a proposal to minimize mean square error, TD methods lie at the heart of different mechanisms operating over diverse representations. Representation is key-difficult problems can be rendered trivial if looked at in the correct way. It is particularly important for systems to be able to learn appropriate representations, since it is rarely obvious from the outset exactly what they should be. For static tasks, generalization is typically sought by awarding similar representations to states that are nearby in some space. This concept extends to tasks involving prediction over time, except that adjacency is defined in terms of similarity of the future course of the behavior of a dynamic system. Section 3 suggests a way, based on this notion of adjacency, of learning representations that should be particularly appropriate for problems to which TD techniques have been applied. Learning these representations can be viewed as a task itself amenable to TD methods, and so requires Neural Computation 5, 613-624 (1993)
@ 1993 Massachusetts Institute of Technology
Peter Dayan
614
no extra machinery. Section 4 shows the nature of the resulting representation for a simple navigation task. Part of this work was reported in Dayan (1991a,b). 2
TD Learning
Consider the problem of estimating expected terminal rewards, or returns, in a finite absorbing Markov chain; this was studied in the context of TD methods by Sutton (1988). An agent makes a transition between nonabsorbing states i and j E N according to the ijth element of the Markov matrix Q, or to absorbing state k E 7 with probability Sik, with a stochastic reinforcement or return whose mean is t k and whose variance is finite. In this and the next section, the returns and transition probabilities are assumed to be fixed. The immediate expected return from state i E N, represented as the ith element of a vector h, is the sum of the probabilities of making immediate transitions to absorbing states times the expected returns from those states:
The overall expected returns, taking account of the possibility of making transitions to nonabsorbing states first, are [f]i
+
+
+
=
[h]; [Qh]; [Q2hIi . . .
=
[ ( I - Q)-'h];
(2.1)
where 1 is the identity matrix. The agent estimates the overall expected return from each state (compiled into a vector r) with a vector-valued function i(w), which depends on a set of parameters w whose values are determined during the course of learning. If the agent makes the transition from state if to iltl in one observed sequence, TD(0) specifies that w should be changed to reduce the error: frti =
[i(w)li,+l- ['(w)li,
(2.2)
where, for convenience, [i(w)] is taken to be the delivered return ~ i , + if if+l is absorbing. This enforces a kind of consistency in the estimates of the overall returns from successive states, which is the whole basis of TD learning. More generally, information about the estimates from later states [i(w)] for s > 1 can also be used, and Sutton (1988) defined the TD(A) algorithm, which weighs their contributions exponentially less according to As. With the TD algorithm specifying how the estimates should be manipulated in the light of experience, the remaining task is one of function approximation. How w should change to minimize the error fI+l +% ,;
~
Improving Generalization for Temporal Difference Learning
615
in equation 2.2 depends on exactly how w determines [i(w)],,.Sutton (1988) represented the nonabsorbing states with real-valued vectors {xi}, [r(w)lias the dot product w xi of the state vector with w taken as a vector of weights, and changed w in proportion to
-
-(w . xi,+, - w . Xi,)Xi, using z;,+~instead of w . xi,+l if it+l is absorbing. This is that part of the gradient -Vwct+l that comes from the error at step xi,, ignoring the contribution from xi,+l (Werbos 1990; Dayan 1992). In the "batch-learning" case for which the weights are updated only after absorption, Sutton showed that if the learning rate is sufficiently small and the vectors representing the states are linearly independent, then the expected values of the estimates converge appropriately. Dayan (1992) extended this proof to show the same was true of TD(X) for 0 < x < 1. 3 Time-Based Representations
One of the key problems with TD estimation, and equivalently with TD based control (Barto et al. 19891, is the speed of learning. Choosing a good method of function approximation, which amounts in the linear case to choosing good representations for the states, should make a substantial difference. For prediction problems such as the one above, the estimated expected overall return of one state is a biased sum of the estimated expected overall returns of its potential successors. This implies that for approximation schemes that are linear in the weights w, a good representation for a state would be one that resembles the representations of its successors, being only a small Euclidean distance away from them (with the degrees of resemblance being determined by the biases). In this way, the estimated value of each state can be partially based on the estimated values of those that succeed it, in a way made more formal below. For conventional, static, problems, received wisdom holds that distributed representations perform best, so long as the nature of the distribution somehow conforms with the task-nearby points have nearby solutions. The argument above suggests that the same is true for dynamic tasks, except that neighborliness is defined in terms of temporal succession. If the transition matrix of the chain is initially unknown, this representation will have to be learned directly through experience. Starting at state i E N, imagine trying to predict the expected future occupancy of all other states. For the jth state, j E N, this should be [Xilj
=
+ [Qlij + [Q'lij +
[llij
= [ ( I - Q)-']ij.
.
'
(3.1)
where [MI;, is the ijth element of matrix M and 1 is the identity matrix. Representing state i using Xi is called the successor representation (SR).
Peter Dayan
616
A TD algorithm itself is one way of learning SR. Consider a punctate representation that devotes one dimension to each state and has the Ith element of the vector representing state k, [x& equal to [Ilk/. Starting from i, = i, the prediction of how often [xi,], = 1 for s 2 t is exactly the prediction of how often the agent will visit state j in the future starting from state i, and should correctly be [x,lj. To learn this, the future values of [xi,], for s 2 t can be used in just the same way that the future delivery of reinforcement or return is used in standard TD learning. For a linear function approximator, it turns out that SR makes easy the resulting problem of setting the optimal weights w*, which are defined as those making i = r(w*). If X is the matrix of vectors representing the states in the SR, [Xli, = [x,Ii, then W*is determined as
XTw* = f which implies, from equations 2.1 and 3.1, that W* = h
But h is just the expected immediate return from each state-it is insensitive to all the temporal dependencies that result from transitions to nonabsorbing states. The SR therefore effectively factors out the entire temporal component of the task, leaving a straightforward estimation problem for which TD methods would not be required. This can be seen in the way that the transition matrix Q disappears from the update equation, just as would happen for a nontemporal task without a transition matrix at all. For instance, for the case of an absorbing Markov chain with batch-learning updates, Sutton showed that the TD(0) update equation for the mean value of the weights W,l satisfies Wn+l = W,l
+ nXD(h + QX'W,
- XTW,,)
where X is the representation, a is the learning rate, and, since the updates are made after observing a whole sequence of transitions from start to absorption rather than just a single one, D is the diagonal matrix whose diagonal elements are the average number of times each state is visited on each sequence. Alternatively, directly from the estimates of the values of the states, (XTW,I+l- i) =
[I - aXTXD(Z- Q)](XTW,- i)
Using X instead, the update becomes W,l+l = W,
+ aXD(h - WIl)
or (Wtf+l
- h) = (I - a X D ) ( W , - h)
Improving Generalizationfor Temporal Difference Learning
617
Since X is invertible, Sutton’s proof that XTW, + r, and therefore that W,,+ h as n .+ 00, still holds. I conjecture that the variance of these estimates will be lower than those for other representations X (e.g., X = 0 because of the exclusion of the temporal component. For control problems it is often convenient to weigh future returns exponentially less according to how late they arrive-this effectively employs a discount factor. In this case the occupancy of future states in equation 3.1 should be weighed exponentially less by exactly the same amount. A possible objection to using TD learning for SR is that it turns the original temporal learning problem-that of predicting future reinforcement-into a whole set of temporal learning problems-those of predicting the future occupancy of all the states. This objection is weakened in two cases: 0
0
The learned predictions can be used merely to augment a standard representation such as the punctate one. An approximately appropriate representation can be advantageous even before all the predictions are quite accurate. Unfortunately this case is hard to analyze because of the interaction between the learning of the predictions and the learning of the returns. Such a system is used in the navigation example below. The agent could be allowed to learn the predictions by exploring its environment before it is first rewarded or punished. This can be viewed as a form of latent learning and works because the representation does not depend on the returns.
One could regard these predictions as analogous to the hidden representations in Anderson’s (1986) multilayer backpropagation TD network in that they are fashioned to be appropriate for learning TD predictions but are not directly observable and so have to be learned. Whereas Anderson‘s scheme uses a completely general technique that makes no explicit reference to states’ successors, SR is based precisely on what should comprise a good representation for temporal tasks. 4 Navigation Illustration
Learning the shortest paths to a goal in a maze such as the one in Figure 1 was chosen by Watkins (1989) and Barto et al. (1989) as a good example of how TD control works. For a given policy, that is, mapping from positions in the grid to directions of motion, a TD algorithm is used to estimate the distance of each state from the goal. The agent is provided with a return of -1 for every step that does not take it to the goal and future returns, that is, future steps, are weighed exponentially less using a discount factor. The policy is improved in an asynchronous form of
Peter Dayan
618
Agent Goal
Barrier ’ Figure 1: The grid task. The agent can move one step in any of the four directions except where limited by the barrier or by the walls. dynamic programming’s policy iteration by making more likely those actions whose consequences are better than expected. Issues of representation are made particularly clear in such a simple example. For the punctate case, there can be no generalization between states. Distributed representations can perform better, but there are different methods with different qualities. Watkins (1989), for a similar task, used a representation inspired by Albus’ CMAC (1975). In this case, CMAC squares which cover patches of 3 x 3 grid points are placed regularly over the grid such that each interior grid point is included in 9 squares. The output of the units corresponding to the squares is 0 if the agent is outside their receptive fields, and otherwise, like a radial basis function, is modulated by the distance of the agent from the tenter of the relevant square. Over most of the maze this is an excellent representation-locations that are close in the Manhattan metric on the grid are generally similar distances from the goal, and are also covered by many of the same CMAC squares. Near the barrier, however, the distribution of the CMACs actually hinders learning-locations close in the grid but on opposite sides of the barrier are very different distances from the goal, and yet still share a similar CMAC square representation. By contrast, the successor representation, which was developed in the previous section, produces a CMAC-like representation that adapts correctly to the barrier. If the agent explores the maze with a completely random policy before being forced to find the goal, the learned SR would closely resemble the example shown in Figure 2. Rather like a CMAC square, the representation decays exponentially away from the starting state (5,6)in a spatially ordered fashion-however, note SRs recognition
Improving Generalization for Temporal Difference Learning
619
Figure 2: The predictions of future occupancy starting from (5,6) after exploration in the absence of the goal. The z-coordinate shows the (normalized) predictions, and the barrier and the goal are overlaid. The predictions decay away exponentially from the starting location, except across the bamer. that states on the distant side of the bamer are actually very far away in terms of the task (and so the predictions are too small to be visible). Simulations confirm that using the SR in conjunction with a punctate representation leads to faster learning for this simple task (see Fig. 31, even if the agent does not have the chance to explore the maze before being forced to find the goal. This example actually violates the stationarity assumption made in Section 2 that transition probabilities and returns are fixed. As the agent improves its policy, the mean number of steps it takes to go from one state to another changes, and so the SR should change too. Once the agent moves consistently along the optimal path to the goal, locations that are not on it are never visited, and so the prediction of future occupancy of those should be 0. Figure 4 shows the difference between the final and initial sets of predictions of future occupancy starting from the same location ( 5 , 6 ) as before. The exponential decay along the path is caused by the discount factor, and the path taken by the agent is clear. If the task for the agent were changed such that it had to move from anywhere
Peter Dayan
620
.-- Rpunctate
- - --
-
-0
-
b
8 b .
&MAC
RSR, no latent learning RsR,
latent learning
'\'\
- \', *'
'Ib 8 '8
10
100
loo0 Learning iterations
Figure 3 Learning curves comparing punctate representation (Rpunctate),CMACsquares (&MAC) and a punctate representation augmented with the SR (RsR), in the latter case both with and without an initial, unrewarded, latent learning phase. TD control learning as in Barto et al. (1989) is temporarily switched off after the number of trials shown in the x-axis, and the y-axis shows the average number of excess steps the agent makes on the way to the goal starting from every location in the grid. Parameters are in Dayan (1991b). on the grid to a different goal location, this new form of the SR would actually hinder the course of learning, since its distributed character no longer correctly reflects the actual nature of the space. This demise is a function of the linked estimation and control, and would not be true for pure estimation tasks. 5 Discussion
This paper has considered some characteristics of how representation determines the performance of TD learning in simple Markovian environments. It suggests that what amounts to a local kernel for the Markov
Improving Generalization for Temporal Difference Learning
621
Figure 4: The degradation of the predictions. Both graphs show the differences between the predictions after 2000 steps and those initially-the top graph as a surface, with the barrier and the goal overlaid, and the bottom graph as a density plot. That the final predictions just give the path to the goal is particularly clear from the white (positive) area of the density plot-the black (negative) area delineates those positions on the grid that are close to the start point (5,6), and therefore featured in the initial predictions, but are not part of this ultimate path.
622
Peter Dayan
chain is an appropriate distributed representation, because it captures all the necessary temporal dependencies. This representation can be constructed during a period of latent learning and is shown to be superior in a simple navigation task, even over others that also share information between similar locations. Designing appropriate representations is a key issue for many of the sophisticated learning control systems that have recently been proposed. However, as Barto et al. (1991) pointed out, a major concern is that the proofs of convergence of TD learning have not been very extensively generalized to different approximation methods. Both Moore (1990) and Chapman and Kaelbling (1991) sought to exorcise the daemon of dimensionality by using better function approximation schemes, which is an equivalent step to using a simple linear scheme with more sophisticated input representations. Moore used kd trees (see Omohundro 1987, for an excellent review), which have the added advantage of preserving the integrity of the actual values they are required to store, and so preserve the proofs of the convergence of Q-learning (Barto et al. 1991; Watkins and Dayan 1992). However just like the CMAC representation described above, the quality of the resulting representation depends on an a priori metric, and so is not malleable to the task. Chapman and Kaelbling also used a tree-like representation for Qlearning, but their trees were based on logical formulas satisfied by their binary-valued input variables. If these variables do not have the appropriate characteristics, the resulting representation can turn out to be unhelpful. It would probably not afford great advantage in the present case. Sutton (1990),Thrun etal. (19911, and others have suggested the utility of learning the complete transition matrix of the Markov chain, or, for the case of control, the mapping from states and actions to next states. Sutton used this information to allow the agent to learn while it is disconnected from the world. Thrun, Moller and Linden used it implicitly to calculate the cost of and then improve a projected sequence of actions. The SR is less powerful in the sense that it provides only an appropriately distributed representation and not a veridical map of the world. A real map has the added advantage that its information is independent of the goals and policies of the agent; however, it is more difficult to learn. Sutton’s scheme could equally well be used to improve a system based on the learned representation. Sutton and Pinette (1985)discussed a method for control in Markovian domains that is closely related to the SR and that uses the complete transition matrix implicitly defined by a policy. In the notation of this paper, they considered a recurrent network effectively implementing the iterative scheme
where xi is the punctate representation of the current state i and Q is the
Improving Generalization for Temporal Difference Learning
623
transition matrix. x, converges to Xi from equation 3.1, the SR of state i. Rather than use this for representational purposes, however, Sutton and Pinette augmented Q so that the sum of future returns is directly predicted through this iterative process. This can be seen as an alternative method of eliminating the temporal component of the task, although the use of the recurrence implies that the final predictions are very sensitive to errors in the estimate of Q. The augmented Q matrix is learned using the discrepancies between the predictions at adjacent time steps-however, the iterative scheme complicates the analysis of the convergence of this learning algorithm. A particular advantage of their method is that a small change in the model (e.g., a slight extension to the barrier) can instantaneously lead to dramatic changes in the predictions. Correcting the SR would require relearning all the affected predictions explicitly. Issues of representation and function approximation are just as key for sophisticated as unsophisticated navigation schemes. Having a representation that can learn to conform to the structure of a task has been shown to offer advantages-but any loss of the guarantee of convergence of the approximation and dynamic programming methods is, of course, a significant concern.
Acknowledgments
I am very grateful to Read Montague, Steve Nowlan, Rich Sutton, Terry Sejnowski, Chris Watkins, David Willshaw, the connectionist groups at Edinburgh and Amherst, and the large number of people who read drafts of my thesis for their help and comments. I am especially grateful to Andy Barto for his extensive and detailed criticism and for pointers to relevant literature. Support was from the SERC.
References Albus, J. S. 1975. A new approach to manipulator control: The CerebellarModel Articulation Controller (CMAC). Transact. ASME: J. Dynam. Syst. Measure. Control 97, 220-227. Anderson, C. W. 1986. Learning and problem solving with multilayer connectionist systems. Ph.D. Thesis, University of Massachusetts, Amherst, MA. Barto, A. G., Sutton, R. S., and Watkins, C. J. C. H. 1989. Learning and sequential decision making. Tech. Rep. 89-95, Computer and Information Science, University of Massachusetts, Amherst, MA. Barto, A. G., Bradtke, S. J., and Singh, S. P. 1991. Real-time learning and control using asynchronous dynamic programming. TR 91-57, Department of Computer Science, University of Amherst, MA.
Peter Dayan
624
Chapman, D., and Kaelbling, L. P. 1991. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. Proceedings of the 1991 International Joint Conference on Artificial Intelligence, 726-731. Dayan, P. 1991a. Navigating through temporal difference. In Advances in Neural Information Processing, Vol. 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 464-470. Morgan Kaufmann, San Mateo, CA. Dayan, P. 1991b. Reinforcing connectionism: Learning the statistical way. Ph.D. Thesis, University of Edinburgh, Scotland. Dayan, P. 1992. The convergence of TD(A) for general A. Machine Learn. 8, 341-362. Moore, A. W. 1990. EfFcient memory-based learning for robot control. Ph.D. Thesis, University of Cambridge Computer Laboratory, Cambridge, England. Omohundro, S. 1987. Efficient algorithms with neural network behaviour. Complex Syst. 1, 273-347. Samuel, A. L. 1959. Some studies in machine learning using the game of checkers. Reprinted in Computers and Thought, E. A. Feigenbaum and J. Feldman, eds. McGraw-Hill, New York, 1963. Sutton, R. S. 1984. Temporal credit assignment in reinforcement learning. Ph.D. Thesis, University of Massachusetts, Amherst, MA. Sutton, R. S. 1988. Learning to predict by the methods of temporal difference. Machine Learn. 3, 9 4 . Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh lnternational Conference on Machine Learning. Morgan Kaufmann, San Mateo, CA.
Sutton, R. S., and Pinette, 8. 1985. The learning of world models by connectionist networks. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, pp. 54-64. Lawrence Erlbaum, Irvine, CA. Thrun, S. B., Moller, K., and Linden, A. 1991. Active exploration in dynamic environments. In Advances in Neural lnformation Processing, Vol. 3, R. P. Lippmann, J. E. Moody, and D. S. Touretzky, eds., pp. 450-456. Morgan Kaufmann, San Mateo, CA. Watkins, C. J. C. H. 1989. Learningfrom delayed rewards. Ph.D. Thesis, University of Cambridge, England. Watkins, C. J. C. H., and Dayan, P. 1992. &-learning. Machine Learn. 8,279-292. Werbos, P. J. 1990. Consistency of HDP applied to a simple reinforcement learning problem. Neural Networks 3, 179-189. Received 20 January 1992; accepted 20 October 1992.
This article has been cited by: 2. Sridhar Mahadevan. 2008. Representation Discovery using Harmonic Analysis. Synthesis Lectures on Artificial Intelligence and Machine Learning 2:1, 1-147. [CrossRef] 3. P. Read Montague, Brooks King-Casas, Jonathan D. Cohen. 2006. IMAGING VALUATION MODELS IN HUMAN CHOICE. Annual Review of Neuroscience 29:1, 417-448. [CrossRef] 4. Roland E. Suri , Wolfram Schultz . 2001. Temporal Difference Model Reproduces Anticipatory Neural ActivityTemporal Difference Model Reproduces Anticipatory Neural Activity. Neural Computation 13:4, 841-862. [Abstract] [PDF] [PDF Plus] 5. D.J. Foster, R.G.M. Morris, Peter Dayan. 2000. A model of hippocampally dependent navigation, using the temporal difference learning rule. Hippocampus 10:1, 1-16. [CrossRef] 6. Peter DayanReinforcement Learning . [CrossRef]
Communicated by Geoffrey Hinton
Discovering Predictable Classifications Jiirgen Schmidhuber’ Department of Computer Science, University of Colorado, Boulder, CO 80309, USA
Daniel Prelinger Institut fiir Informatik, Technische Universitat Miinchen, Arcisstrasse 21,8000 Miinchen 2, Germany
Prediction problems are among the most common learning problems for neural networks (e.g., in the context of time series prediction, control, etc.). With many such problems, however, perfect prediction is inherently impossible. For such cases we present novel unsupervised systems that learn to classih patterns such that the classifications are predictable while still being as specific as possible. The approach can be related to the IMAX method of Becker and Hinton (1989)and Zemel and Hinton (1991). Experiments include a binary stereo task proposed by Becker and Hinton, which can be solved more readily by our system.
1 Motivation and Basic Approach
Many neural net systems (e.g., for control, time series prediction) rely on adaptive submodules for learning to predict patterns from other patterns. Perfect prediction, however, is often inherently impossible. In this paper we study the problem of finding pattern class@cations such that the classes are predictable, while still being as specific as possible. To grasp the basic idea, let us discuss several examples. Example 1: Hearing the first two words of a sentence “Henrietta eats. .. ” allows you to infer that the third word probably indicates something to eat but you cannot tell what. The class of the third word is predictable from the previous words-the particular instance of the class is not. The class ”food” is not only predictable but also nontrivial and specific in the sense that it does not include aterything-”John,” for instance, is not an instance of “food.” *Current address: Institut fiir Informatik, Technische Universit5t Miinchen, Arcisstrasse 21,8000 Miinchen 2, Germany.
Neural Computation 5,625-635 (1993) @ 1993 Massachusetts Institute of Technology
626
Jiirgen Schmidhuber and Daniel Prelinger
The problem is to class+ patterns from a set of training examples such that the classes are both predictable and not too general. A general solution to this problem would be useful for discovering higher level structure in sentences generated by unknown grammars, for instance. Another application would be the unsupervised classification of different pattern instances belonging to the same class, as will be seen in the next example.
Example 2 (stereo task; due to Becker and Hinton 1989): There are two binary images called the "left" image and the "right" image. Each image consists of two "strips"+ach strip being a binary vector. The right image is purely random. The left image is generated from the right image by choosing, at random, a single global shift to be applied to each strip of the right image. An input pattern is generated by concatenating a strip from the right image with the corresponding strip from the left image. "So the input can be interpreted as a fronto-parallel surface at an integer depth. The only local property that is invariant across space is the depth (i.e. shift)'' (Becker and Hinton 1989). With a given pair of different input patterns, the task is to extract a nontrivial classification of whatever is common to both patterns-which happens to be the stereoscopic shift. Example 1 is an instance of the so-called asymmetric case: There we are interested in a predictable nontrivial classification of one pattern (the third word), given some other patterns (the previous words). Example 2 is an instance of the so-called symmetric case: There we are interested in the nontrivial common properties of two patterns from the same class. In its simplest form, our basic approach to unsupervised discovery of predictable classifications is based on two neural networks called TI and T2. Both can be implemented as standard backpropagation networks (Werbos 1974; LeCun 1985; Parker 1985; Rumelhart et al. 1986). With a given pair of input patterns, TI sees the first pattern and T2 sees the second pattern. Let us first focus on the asymmetric case. For instance, with the example 1 above TImay see a representation of the words "Henrietta eats," while T2 may see a representation of the word "vegetables." TZ'S task is to classify its input. Ti's task is not to predict T2's raw environmental input but to predict T2's output instead. Both networks have q output units. Let p E (1,. . . m } index the input patterns. T2 produces as an ou ut the classification yPv2 E (0,.. . 114 in response to an input vector xp? TI'S output in response to its input vector xPJ is the prediction ypJ E [0,. . . 114 of the current classification ypy2 emitted by T2. We have two conflicting goals which in general are not simultaneously satisfiable: (1)All predictions yp~' should match the corresponding classifications yPi2. (2) The ypt2 should be discriminative-different inputs xPv2 should lead to different classifications ypi2. We express the trade-off between (1) and (2) by means of two opposing costs.
Discovering Predictable Classifications
627
(1)is expressed by an error term M (for "Match): m
M=
Ilyp,' - yp*2112
(1.1)
p=l
Here llzlll denotes the euclidean norm. (2) is enforced by an additional error term 0 2 (for "Discrimination") to be minimized by T2 only. 0 2 will be designed to encourage significant euclidean distance between classifications of different input patterns. 02 can be defined in more than one reasonable way. The next section will list four alternative possibilities with mutual advantages and disadvantages. These alternatives include (1) a novel method for constrained variance maximization, (2) autoencoders, and (3) a recent technique called "predictability minimization" (Schmidhuber 1992). The total error to be minimized by T2 is EM+ (1 - E)D2
(1.2)
where 0 < E < 1 determines the relative weighting of the opposing error terms. In the asymmetric case, the total error to be minimized by TI is just €M
(1.3)
The error functions are minimized by gradient descent. This forces the predictions and classifications to be more like each other, while at the same time forcing the classifications not to be too general but to tell something about the current input. The procedure is unsupervised in the sense that no teacher is required to tell T2 how to classify its inputs. With the symmetric case (see example 2 above), both TI and T2 are naturally treated in a symmetric manner. They share the goal of uniquely representing as many of their input patterns as possible-under the constraint of emitting equal (and therefore mutually predictable) classifications in response to a pair of input patterns. Such classifications represent whatever abstract properties are common to both patterns of a typical pair. For handling such symmetric tasks in a natural manner, we only slightly modify TI'S error function for the asymmetric case, by iqroducing an extra "discriminating" error term D1 for TI. Now both T I ,1 = 1,2 minimize EM+ (1- E)D,
(1.4)
where alternative possibilities for defining the DIwill be defined in the next section. Figure 1 shows a system based on (equation 1.4) and a particular implementation of DI(to be explained in Section 2.4). The assumption behind our basic approach is that a prediction that closely (in the euclidean sense) matches the corresponding classification is a nearly accurate prediction. Likewise, two very similar (in the euclidean
JiirgenSchmidhuber and Daniel Prelinger
628
prediction
pndiction
prediction
pediction
prediction
prediction
prediction
pdictim
Figure 1: Two networks try to transform their different inputs to obtain the same representation. Each network is encouraged to tell something about its input by means of the recent technique for “predictability minimization.”This technique requires additional intrurepresentationalpredictors (8 of them shown above) for detecting redundancies among the output units of the networks. Alternatives are provided in the text.
sense) classifications emitted by a particular network are assumed to have very similar ”meaning.” It should be mentioned that in theory, even the slightest differences between classifications of different patterns are sufficient to convey all (maximal) Shannon information about the patterns (assuming noise-free data). But then close matches between predictions and classifications could not necessarily be interpreted as accurate predictions. The alternative designs of DJ (to be described below), however, will have the tendency to emphasize differences between different classifications by increasing the euclidean distance between them (sometimes under certain constraints, see Section 2). There is another reason why this is a reasonable thing to do: In a typical application, a classifier will function as a preprocessor for some higher level network. We usually do not want higher level input representations with different “meaning” to be separated by tiny euclidean distance. Weight sharing. If both TIand T2 are supposed to provide the same outputs in response to the same inputs (this holds for the stereo task but does not hold in the general case) then we need only one set of weights
Discovering Predictable Classifications
629
for both classifiers. This reduces the number of free parameters (and may improve generalization performance). Outline. The current section motivated and explained our basic approach. Section 2 presents various instances of the basic approach (based on various possibilities for defining 01). Section 3 mentions previous related work. Section 4 presents illustrative experiments and experimentally demonstrates advantages of our approach. 2 Alternative Definitions of
DI
This section lists four differentapproaches for defining DI, the term which enforces nontrivial discriminative classifications. Section 2.1 presents a novel method that encourages locally represented classes (like with winner-take-all networks). The advantage of this method is that the class representations are orthogonal to each other and easy to understand; its disadvantage is the low representation capacity. In contrast, the remaining methods can generate distributed class representations. Section 2.2 defines DIwith the help of autoencoders. One advantage of this straightforward method is that it is easy to implement. A disadvantage is that predictable information conveyed by some input pattern does not necessarily help to minimize the reconstruction error of an autoencoder (this holds for the stereo task, for instance). Section 2.3 mentions the Infomax approach for defining DIand explains why we do not pursue this approach. Section 2.4 finally defines DIby the recent method for predicfubility minimization (Schmidhuber 1992). An advantage of this method is its potential for creating distributed class representations with statistically independent components. 2.1 Maximizing Constrained Output Variance. We write
(2.1) and minimize DIsubject to the constraint vp:
C$'='
(2.2)
i
Here, as well as throughout the remainder of this paper, subscripts of symbols denoting vectors denote vector components: vi denotes the ith element of some vector D. is a positive constant, and y; denotes the mean of the ith output unit of TI. It is possible to show that the first term on the right-hand side of equation 2.1 is maximized subject to equation 2.2 if each input pattern is locally represented (just like with winnertake-all networks) by exactly one comer of the q-dimensional hypercube spanned by the possible output vectors, if there are sufficient output units
630
Jurgen Schmidhuber and Daniel Prelinger
(Prelinger 19921.' Maximizing the second negative term encourages each local class representation to become active in response to only ith of all possible input patterns. Constraint 2.2 is enforced by setting
where up?' is the activation vector (in response to x p i ) of a q-dimensional layer of hidden units of Ti,which can be considered as its unnormalized output layer. This novel method is easy to implement-it achieves an effect similar to the one of the recent entropy-based method by Bridle and MacKay (1992). 2.2 Autoencoders. With pattern p and classifier TI a reconstructor module A1 (another backpropagation network) receives yp,' as an input. The combination of TIand A1 functions as an autoencoder. The autoencoder is trained to emit the reconstruction bpi' of TI'Sexternal input xp,', thus forcing yp-' to tell something about xpg'. DIis defined as
2.3 Infomax. Following Linsker's Infomax approach (Linsker 1988), we might think of defining -DI explicitly as the mutual information between the inputs and the outputs of TI. We did not use Infomax methods in our experiments for the following reasons: (1)There is no efficient and general method for maximizing mutual information. (2)With our basic approach from Section 1, Infomax makes sense only in situations where it automatically enforces high variance of the outputs of the TI (possibly under certain constraints). This holds for the simplifying gaussian noise models studied by Linsker, but it does not hold for the general case. (3) Even under appropriate gaussian assumptions, with more than one-dimensional representations, Infomax implies maximization of functions of the determinant DET of the covariance matrix of the output activations (Shannon 1948). In a small application, Linsker explicitly calculated DET's derivatives. In general, however, this is clumsy. 'Simply maximizing the variance of the output units without obeying constraint 2.2 will not necessarily maximize the number of different classifications. Example: Consider a set of four different four-dimensional input patterns 1O00, 0100, 0010, Oool. Suppose the classifier maps the first two input patterns to the four-dimensional output pattern 1100 and the other two to 0011. This will yield a variance of 4. A "more discriminative" response would map each pattern to itself, but this will yield a lower variance of 3.
Discovering Predictable Classifications
631
2.4 Predictability Minimization. Schmidhuber (1992) shows how DI can be defined with the help of intrarepresentational adaptive predictors that try to predict each output unit of some TI from its remaining output units, while each output unit in turn tries to extract properties of the environment that allow it to escape predictability. This was called the principle of predictability minimization. This principle encourages each output unit of TI to represent environmental properties that are statistically independent from environmental properties represented by the remaining output units. The procedure aims at generating binary “factorial codes” (Barlow et al. 1989). It is our preferred method, because [unlike the methods used by Linsker (19881, Becker and Hinton (1989), and Zemel and Hinton (1991)l it has a potential for removing even nonlinear statistical dependencies2 among the output units of some classifier. Let us define D’ = --1 - $92 (2.4) 2 i
c(sy
where the sr’ are the outputs of Sl, the ith additional so-called intrarepresentational predictor network of TI (one such additional predictor network is required for each output unit of 7’1). The goal of Si is to emit the conditioned expectation of given {$”, k # i } . This goal is achieved by simply training Sl to predict 8’ from {I&’, k # i } (see Fig. 1). To encourage even distributions in output space, we slightly modify DI by introducing a term similar to the one in equation 2.1 and obtain
e’
(2.5) 3 Relation to Previous Work
Becker and Hinton (1989) solve symmetric problems (like the one of example 2, see Section 1) by maximizing the mutual information between the outputs of TI and T2 (IMAX). This corresponds to the notion of finding mutually predictable yet informative input transformations. One variation of the IMAX approach assumes that TI and T2 have single binary probabilistic output units. In another variation, TI and T2 have single real-valued output units. The latter case, however, requires certain (not always realistic) gaussian assumptions about the input and output signals (see also Section 2.3 on Infomax). In the case of vector-valued output representations, Zemel and Hinton (1991) again make simplifymg gaussian assumptions and maximize functions of the determinant D of the 9 x 9-covariance matrices ( D E T M A X ) of the output activations (Shannon 1948) (see Section 2.3). DETMAX *SteveNowlan has described an alternative nonpredictor-based approach for finding nonredundant codes (Nowlan 1988).
632
Jurgen Schmidhuber and Daniel Prelinger
can remove only linear redundancy among the output units. (It should be mentioned, however, that with Zemel’s and Hinton’s approach the outputs may be nonlinear functions of the inputs.) The nice thing about IMAX is that it expresses the goal of finding mutually predictable yet informative input transformations in a principled way (in terms of a single objective function). In contrast, our approach involves two separate objective functions that have to be combined using a relative weight factor. An interesting feature of our approach is that it conceptually separates two issues: (1) the desire for discriminating mappings from input to representation, and (2) the desire for mutually predictable representations. There are many different approaches (with mutual advantages and disadvantages) for satisfying (1). In the context of a given problem, the most appropriate alternative approach can be “plugged into” our basic architecture. Another difference between IMAX and our approach is that our approach enforces not only mutual predictability but also equality of yp,’ and yPs2. This does not affect the generality of the approach. Note that one could introduce additional “predictor networks”4ne for learning to predict yP,2from yp,’and another one for learning to predict yp~’from ypv2. Then one could design error functions enforcing mutual predictability (instead of using the essentially equivalent error function M used in this paper). However, this would not increase the power of the approach but would only introduce unnecessary additional complexity. In fact, one advantage of our simple approach is that it makes it trivial to decide whether the outputs of both networks essentially represent the same thing. The following section includes an experiment that compares IMAX to our approach. 4 Illustrative Experiments
The following experiments were conducted using an online backpropagation method with constant step size. With each experiment, positive training examples were randomly drawn from the set of legal pairs of input patterns. Details can be found in Schmidhuber and Prelinger (1992). 4.1 Finding Predictable Local Class Representations. This experiment was motivated by example 1 (see Section 1). At a given time, the “next” symbol emitted by a very simple stochastic “language generator” was not precisely predictable from the “previous” symbol but belonged to a certain class defined by the previous symbol. During training, at a given time TI saw the previous symbol while T2 saw the next symbol. T1minimized equation 1.3,T2 minimized equation 1.2 with D2 defined according to equations 2.1 and 2.2. Ten test runs with 15,000 training it-
Discovering Predictable Classifications
633
erations were conducted. T2 always learned to emit different localized representations in response to members of predictable classes, while superfluous output units remained switched off. 4.2 Stereo Task. The binary stereo experiment described in Becker and Hinton (1989)(see also example 2 in Section 1) served to compare IMAX to our approach. Becker and Hinton report that their system (based on binary probabilistic units) was able to extract the “shift” between two simple stereoscopic binary images only if IMAX was applied in successive ”layer by layer” bootstrap stages. In addition, they heuristically tuned the learning rate during learning. Finally they introduced a maximal weight change for each weight during gradient ascent. In contrast, the method described herein (based on continuous-valued units) does not rely on successive bootstrap stages or any other heuristic considerations. We minimized equation 1.4 with DIdefined by predictability minimization according to equation 2.5. With a first experiment, we employed a different set of weights for each network. With 10 test runs involving 100,000 training patterns the networks always learned to extract the stereoscopic shift. This performance of our nonbootstrupped system is comparable to the performance of Becker and Hinton’s bootstrapped system. With a second experiment, we used only one set of weights for both networks (this leads to a reduction of free parameters). The result was a significant decrease of learning tim-with 10 test runs the system needed between 20,000 and 50,000 training patterns to learn to extract the shift.
4.3 Finding Predictable Distributed Representations. Two properties of some binary input vector are the truth values of the following expressions:
1. There are more “ones” on the “right” side of the input vector than on the “left” side. 2. The input vector consists of more “ones” than “zeros.”
During one learning cycle, a randomly chosen legal input vector was presented to TI#another input vector randomly chosen among those with the feature combination of the first one was presented to Tz.TI and TZ were constrained to have the same weights. Input vectors with equal numbers of ones and zeros as well as input vectors with equal numbers of ones on both sides were excluded. We minimized equation 1.4 with DIdefined by an autoencoder (equation 2.3). Ten test runs involving 15,000 pattern presentations were conducted. The system always came up with a distributed near-binary representation of the possible feature combinations.
634
Jurgen Schmidhuber and Daniel Prelinger
With D, defined by modified predictability minimization (equation 2.5), with simultaneous training of both predictors and classifiers, 10 test runs involving 10,000 pattern presentations were conducted. Again, the system always learned to extract the two features. 5 Conclusion In contrast to the principled approach embodied by IMAX, our methods (1)tend to be simpler (e.g., do not require sequential layer by layer "bootstrapping" or learning rate adjustments-the stereo task can be solved more readily by our system), (2) do not require gaussian assumptions about the input or output signals, (3) do not require something like DETMAX, and (4) partly have (unlike DETMAX) a potential for creating classifications with statistically independent components (this holds for DI defined according to Section 2.4). In addition, our approach makes it easier to decide whether the outputs of both networks essentially represent the same thing. The experiments above show that the alternative methods of Section 2 can be useful for implementing the D,terms in equation 1.4 to obtain predictable informative input transformations. More experiments are needed, however, to become clear about their mutual advantages and disadvantages. It also remains to be seen how well the methods of this paper scale to larger problems.
Acknowledgments We thank Mike Mozer for fruitful discussions and Mike Mozer, Sue Becker, Rich Zemel, and an unknown referee for helpful comments on drafts of this paper. This research was supported in part by a DFG fellowship to J. Schmidhuber, as well as by NSF Award IN-9058450, Grant 90-21 from the James S. McDonnell Foundation.
References Barlow, H. B., Kaushal, T. P., and Mitchison, G. J. 1989. Finding minimum entropy codes. Neural Comp., 1(3), 412-423. Becker, S., and Hinton, G. E. 1989. Spatial coherence as an internal teacher for a neural network. Tech. Rep. CRG-TR-89-7, Department of Computer Science, University of Toronto, Ontario. Bridle, J. S., and MacKay, D. J. C. 1992. Unsupervised classifiers, mutual information and 'phantom' targets. In Advances in Neural Information Processing Systems 4, D. S. Lippman, J. E. Moody, and D. S. Touretzky, eds., pp. 10961101. Morgan Kaufmann, San Mateo, CA. LeCun, Y. 1985. Une procedure d'apprentissage pour rbseau A seuil asymbtrique. Proceedings of Cognitiva 85, Paris, 599-604.
Discovering Predictable Classifications
635
Linsker, R. 1988. Self-organization in a perceptual network. IEEE Computer 21, 105-117. Nowlan, S. J. 1988. Auto-encoding with entropy constraints. In Proceedings of ZNNS First Annual Meeting, Boston, M A . Also published in special supplement to Neural Networks. Parker, D. 8. 1985. Learning-logic. Tech. Rep. TR-47, Center for Comparative Research in Economics and Management Science, MIT. Prelinger, D. 1992. Diploma thesis. Institut fiir Informatik, Technische Universitat Miinchen. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Schmidhuber, J. H. 1992. Learning factorial codes by predictability minimization. Neural Comp. 4(6), 863-879. Schmidhuber, J. H., and Prelinger, D. 1992. Discovering predictable classifications. Tech. Rep. CU-CS-626-92, Department of Computer Science, University of Colorado at Boulder. Shannon, C. E. 1948. A mathematical theory of communication (parts I and 11). BelI System Tech. 1. XXVII, 379-423. Werbos, P. J. 1974. Beyond regression: N a o tools for prediction and analysis in the behavioral sciences. Ph.D. thesis, Harvard University. Zemel, R. S., and Hinton, G. E. 1991. Discovering viewpoint-invariant relationships that characterize objects. In Advances in Neural Information Processing Systems3, D. S. Lippman, J. E. Moody, and D. S. Touretzky, eds., pp. 299-305. Morgan Kaufmann, San Mateo, CA. Received 2 June 1992; accepted 5 January 1993.
This article has been cited by: 2. Sepp Hochreiter , Jürgen Schmidhuber . 1999. Feature Extraction Through LOCOCODEFeature Extraction Through LOCOCODE. Neural Computation 11:3, 679-714. [Abstract] [PDF] [PDF Plus] 3. Virginia R. de Sa , Dana H. Ballard . 1998. Category Learning Through Multimodality SensingCategory Learning Through Multimodality Sensing. Neural Computation 10:5, 1097-1117. [Abstract] [PDF] [PDF Plus] 4. Jim Kay, W. A. Phillips. 1997. Activation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual GuidanceActivation Functions, Computational Goals, and Learning Rules for Local Processors with Contextual Guidance. Neural Computation 9:4, 895-910. [Abstract] [PDF] [PDF Plus] 5. J. Schmidhuber, S. Heil. 1996. Sequential neural text compression. IEEE Transactions on Neural Networks 7:1, 142-146. [CrossRef]
Communicated by Thomas Bart01
A Kinetic Model of Short- and Long-Term Potentiation M. Migliore lnstitute for lnterdisciplinary Applications of Physics, National Research Council, Via Arckirafi 36,l-90123 Palermo, Italy
G . F. Ayala Department of Psyckopkysiology, University of Palermo, Via Pascoli 8,l-90143 Palermo, ltaly
We present a kinetic model that can account for several experimental findings on short- and long-term potentiation (STP and LTP) and their pharmacological modulation. The model, which is consistent with Hebb’s postulate, uses the hypothesis that part of the origin of LTP may be a consequence of an increased release of neurotransmitter due to a retrograde signal. The operation of the model is expressed by a set of irreversible reactions, each of which should be thought of as equivalent to a set of more complex reactions. We show that a retrograde signal alone is not sufficient to maintain LTP unless long-term change of the rate constant of some of the reactions is caused by high-frequency stimulation. Pharmacological manipulation of LTP is interpreted as modifications of the rate constants of one or more of the reactions that express a given mechanism. The model, because of its simplicity, can be useful to test more specific mechanisms by expanding one or more reactions as suggested by new experimental evidence. 1 Introduction STP ( C o h o etal. 1992)and LTP (Bliss and Lomo 1973; Bliss and GardnerMedwin 1973)are the short- or long-lasting increases of synaptic coupling that follow a train of conditioning stimuli. These phenomena are highly reproducible, nevertheless the molecular mechanisms involved in their induction and maintenance are still unclear (Edwards 1991). They are perhaps the most elementary step toward higher brain functions such as memory, learning, associative recalling, and the process of cognition. At this time, only a key role of calcium is well accepted and experimentally demonstrated. Although considerable efforts have been made (Byrne and Berry 1988; Koch and Segev 1989), using a variety of computer models (e.g., Gamble and Koch 1987; Segev and Rall1988; Holmes and Levy 1990), simulations have failed so far to reproduce their central feature, that is the short- or long-lasting enhancement of the excitatory Neural Computation 5, 636-647 (1993) @ 1993 Massachusetts Institute of Technology
A Kinetic Model of Short- and Long-Term Potentiation
637
I
retrograde
presynaptic signal
a
bouton
neurotransmitter level
postsynaptic signal
spine
Figure 1: Schematic representation of the model. Note that the equations derived from this model are a set of simultaneous differentialequations. Each step should be thought of as a representation of a more complex set of events. The two pathways of the presynaptic signal for the production of V are independent in this model. postsynaptic potentials (EPSPs) obtained after high-frequency stimulation (HE) of the afferent pathway. The model presented in this paper mimics this experimental observation. It is based on the growing evidence of a determining role of a retrograde signal (Bredt and Snyder 1992) for the induction of the potentiation, and indicates further that, in order to maintain the potentiation, the rate constant of some reactions should change during the HFS. 2
T h e Model
Our model (Fig. 1)represents a simplified view of the events of the synaptic transmission, where I is an intensive independent variable that represents the input stimulus, and V, C, and K are, respectively, the level of released neurotransmitter molecules, the postsynaptic signal, and the retrograde signal. The operation of the model is expressed by a set of irreversible events, symbolically represented by the following reactions (and corresponding simultaneous differential equations), each of which should be thought of as equivalent to a set of more complex reactions:
I - % V V P C C A K I + K a, V
c’.
(a)
(b) (C)
(d) (e)
The release of neurotransmitter on the synaptic cleft and the production of an output signal are represented, respectively, by reactions (a) and
M. Migliore and G.F. Ayala
638
(b). Reaction (e) represents the degradation of C that could be thought of as the EPSP. Processes represented in these reactions are the classic synaptic events having ample experimental support. Reactions (c) and (d), instead, model (in a way that shall prove productive) the current controversial view (Edwards 1991) that the origin of LTP is due to an increased release of neurotransmitter secondary to a retrograde signal [e.g., nitric oxide (Bredt and Snyder 199211 acting in the feedback loop. In these reactions the variable K is the level of this retrograde signal or plasticity factor (Bliss and Lynch 1988). The system of differential equations derived from this model can be solved analytically. Linear stability analysis (Nicolis and Prigogine 1977) shows that the system is always stable. The only stationary solution that it admits, for 1 > 0, is
Let us now consider the behavior of the system as a consequence of a short pulse of I. While the pulse lasts, the values of V, C, and K will tend to those in equations 1, reaching final values VO, CO,and &, which will depend on the pulse length. From the end of the pulse on, the behavior of the system is described by
v
= Voe-flr
(2.1)
e-Pt)
(2.2)
An interesting consequence of equation 2.2 is that for CO= 0 it reduces to the widely used phenomenological formula describing the time course of EPSPs in terms of empirical times of onset, TO and decay, TD, that is, V E E P = COnSt[TD/(TD-TD)](eWf/m- e d T O ) , The phenomenological formula is thus derived from our model with TO = I/@and TO = l / ( y E ) . Integration of the system of differential equations derived from the model has been carried out with a fourth order Runge-Kutta method (Press et al. 1987)with a fixed time step of 1.O x W4. It should be stressed that the model is too simplified to justify a quantitative comparison with specific biophysical mechanisms that, at the time being, are thought to have a major role in the induction and maintenance of LTP. At this stage, for example, y could represent all those mechanisms that may result in the production of a retrograde signal (e.g., from calcium buffering by calmodulin to NO synthesis by nitric oxide synthase), p the kinetics of activation of ionic currents and E might include leakage, kinetics of
+
A Kinetic Model of Short- and Long-Term Potentiation
639
inactivation of ionic currents and Ca2+pumping, diffusion, and buffering. As we will see, to be qualitatively consistent with experiments, it is not necessary to specify in more details any of these processes, at least within the limits of the scope of this paper. 3 Results and Discussion
3.1 Simulations. Although the values of the rate constants used in our simulations were set arbitrarily, they were chosen to give a relationship among them that is physiologically reasonable and to be convenient from a computational point of view. In fact, E , the rate constant of the “postsynaptic signal degradation,” is the smallest of all the rate constants in order to use most of the postsynaptic signal ”for LTP purposes”; 0 > (y + e), that is, the rate constant of ”channel activation” for the production of C is larger than the sum of the rate constants of the “production of the retrograde signal“ K, and of the “postsynaptic signal degradation,” so that C is always > 0; 6, the rate constant for “the production of neurotransmitter, V, by the retrograde signal,” has a value to yield an amplitude of STP roughly comparable with experimental observations, and y is such that the 1/(7 E ) time constant is not too long, to avoid unnecessarily long simulation times. It turns out that all values are within one order of magnitude, and different values have been tried with no essential differences in the results. The actual model parameters are reported in the legend of Figure 2. The time course of quantities shown in Figure 2 [as simulated according to the equations derived from reactions (a)-(e)] starts from equilibrium with 1 = 0.01, V = VOZ V,, C = CO C, and K = KO K,. If there is enough time between stimuli for all the variables to return to equilibrium, each new stimulus will find the system in unperturbed conditions. This, however, will not be true in the HFS case, that is when the interstimulus interval is shorter than any of the time constants involved in equations 2 in EPSP decaying, that is y and E. In such case, and since > (y E), the average values of V and C will increase. As shown in Figure 2, at the end of HFS, the released neurotransmitter, V, and the postsynaptic signal, C, will return to the equilibrium values but K will be K H s > KO. Thus, the net result of a HFS, is to increase the final value of K to KHB, that will be the new initial, nonequilibrium, condition for the following stimuli. The larger K produces, at each pulse, Vs larger than the ones before HFS, resulting in higher C peaks and initial slopes. This phase can be considered as the equivalent of the STP in the experiments (Colino et al. 19921, and its time scale is essentially dependent on the value of the current in the presynaptic cell [i.e., the 1 independent variable in reaction d ) ] . This prediction of the model might be experimentally tested using different presynaptic holding potentials. It should be noted that still another form of short-lasting potentiation has been
+
+
M. Migliore and G. F. Ayala
640
2
I
1
I
C 1 -
-0 2
K 1 -
I
I
I
0
50
100
150
200
0.0 0
50
100
150
200
0
v 0.5
HFS
Time (a.u.)
Figure 2: Time course of the C, V, and K signals from simulation withlow- and high- (HFS) frequency stimuli. Amplitudes and times are given in arbitrary units. Values for the rate constants are a = 2, p = 5, y = 1, 6 = 10, E = 0.5. Pulses of I, not shown, are of amplitude 1 (arbitrary units) and duration 50 time steps. Interstimulus interval is 4950 and 50 time steps for low and high frequency, respectively. The raising time of K after the HFS as well as after each I pulse is mainly determined by the l / ( y + E ) time constant (see equation 2.3). The duration of the STP phase will also depend from l / ( y + E ) , but it is mainly determined by 6, as consequenceof reaction (d) and by the background current I (in this case I = 0.01).
A Kinetic Model of Short- and Long-Term Potentiation
HFS
time (a.u.)
641
HFS -
time (a.u.)
Figure 3: Left: Time course of C, V, and K from a simulation with reaction (d) changed to K 5 V and KO = 0. Right: Time course of C (solid)and V (dashed) from a simulation with y = 0 and & = 0. Other rate constants, parameters, and protocol of stimulation as in Figure 2. described, the posttetanic potentiation (PTP),and it is usually assumed that it involves an entirely presynaptic mechanism. Our model does not take into account this kind of potentiation because we are assuming that the mechanisms involved in STP and LTP are different from those of I ", namely the role of the retrograde signal. The role of the interaction between I and K in inducing SlT and LTP is also evident from the model. The schematic picture in Figure 1 shows, in its presynaptic portion, two separate pathways for I to produce the neurotransmitter V the first, directly, with a rate constant a, and, the second, interacting with the retrograde signal K, with a rate constant 6. Let us for a moment suppose that the interaction of I with the retrograde signal, K, is impaired, but the first pathway is left intact. In the simulations this is accomplished by suppressing I in reaction ( d ) , changing it to K6'V In such conditions, Figure 3 (left) shows that, using the same protocol of stimulation as before, the synaptic transmission is not impaired, and yet, there is no STP.This example illustrates the role of I in the interaction. The role of K is seen from Figure 3 (right), obtained for y = 0 that is K = 0. The requirement for the simultaneous presence of both I and K is of course consistent with Hebbs postulate (Hebb 1949). This is because the postsynaptic depolarization produces the retrograde signal,
642
M. Migliore and G . F. Ayala
K, capable of enhancing the production of V in response to presynaptic signal. The simulation reported in Figure 4a, where all reaction rate constants are kept fixed, is conveniently compared with the experimental data of Figure 4b. The comparison shows that the peak amplitudes of C (see also Figure 21, in response to stimuli that are qualitatively similar to the standard experimental protocol used to obtain LTP, are not similar to the typical experimental results on LTP but to those on STP. We tested several different simple models, but all of them consistently failed to show LTP. In fact, the system is always stable and it admits only one stable solution. Since the basic idea is to use the retrograde signal, to obtain LTP we can (1) consider alternative and more complex kinetic pathways (e.g., autocatalytic processes) using the retrograde signal or (2) assume that H E triggers the change of some rate constant. Both ways can make explicit predictions and can be used as useful tools to help the interpretation of experiments. Since our major objective in this paper is to keep the model as simple as possible to be mathematically tractable and easily modifiable to include more detailed specific processes, we follow (2). From equations 1 and 2 it follows that the peak value and the initial slope of the output signal C, after HFS, are determined by the equilibrium value of the retrograde signal, K,, and from the rate at which the neurotransmitter produces an output signal, 0.Thus an increase of or any change in some other rate constant that increases K, will give LTP as shown, for example, in Figure 4 c 4 , as obtained by mere alterations of the y and E rate constants [an increase and a decrease, respectively, to keep TD = l / ( y + E ) constant]. These changes appear to be physiologically reasonable and expected if one consider that they correspond to a "more efficient" production of retrograde signal, y, and a change in the processes involved in the postsynaptic signal degradation, E . More important, however, these changes are consistent only with the experimental fact (Gustafsson and Wigstrom 1990) that the time evolution of potentiated EPSPs does not change with LTP, after normalization. It should also be stressed that these changes are to be considered as the net results of changes to kinetic parameters of specific reactions underlying our simple model. The agreement with actual LTP experiments is clearly evident from Figure 4c and d. The transitory increase of C (Fig. 2) during H E has already been simulated and explained in terms of several reactions that involve calcium pumping, diffusion, and buffering in the spine head and neck (Gamble and Koch 1987; Holmes and Levy 1990) as well as the electrotonic characteristics of the membrane. The simple hypothesis at the basis of our model is that the increase of the postsynaptic signal C produces a large retrograde signal that triggers the induction of LTP. Its maintenance is obtained by a change in the rate constant involved in the production of the retrograde signal, and we propose that this change re-
A Kinetic Model of Short- and Long-Term Potentiation
643
0.6
b) 0.4 -
.. .. ..... ......... .:. ...... 0.2 :,:;:': *:., ..$ ;*
.:*,;:t
'
..
0.8
4
*
0.6 ;**:
. . :.. .-.
.:.a
; a : *
0.4 .
....."..'-: "*:. .... ..... .. I
0.2 ;
I time (a.u.)
time (min)
Figure 4: Peak amplitudes of C ( a ) compared with typical experimental findings on STP (b). The arrow on the time axis indicate the time of application of HFS. Parameters and protocol of stimulation as in Figure 1. Peak amplitudes of C (c), from a simulation where at the end of HFS (arrow) the y rate constant is increased to 1.2 and E decreased to 0.3, compared with typical experimental findings on LTP (d). Other rate constants and protocol of stimulation as in Figure 1. (Experimental data taken and redrawn from Colin0 et al. 1992.)
M. Migliore and G. F. Ayala
644
quires a short-term increase in the concentration of the retrograde signal itself.
3.2 Comparison with Experimental Data. The model can account for several experimental observations on the mechanisms involved in LTP, as shown by work currently in progress at our laboratory. For example, the transient depression followed by short-term potentiation after N-methylBaspartate (NMDA)application (Kauer et al. 1988) can be reproduced by assuming an initial reduction of & and the transitory increase of y and B. The presence of extracellular hemoglobin or inhibitors of NO production such as nitro-L-arginine methyl ester (L-NAME)has been shown (Haley et al. 1992) to prevent the maintenance of LTP. In terms of our model this observation suggests that the triggering cause required to produce LTP is not, or not only, the high postsynaptic calcium concentration, but the increase of the retrograde signal. Any pharmacological manipulation that interferes with this transient increase during HFS may prevent LTP. However, at this point the model cannot account for those experiments where the application of L-NAME after LTP has been established fails to inhibit the maintenance of the potentiation. The model easily accounts also for the finding that LTP cannot be induced by an HFS without a conjunctive depolarization of the postsynaptic cell, for example, when the postsynaptic cell is voltage clamped (Kelso et al. 1986). In fact, a voltage clamp corresponds, in our model, to force C to a low value during HFS.This has the effect to prevent the increase in the level of the retrograde signal K during HFS, and thus inhibiting the changes of the rate constants, that we assume to be consequent to the high K. It has been shown that calcium chelators ethylene-bis(oxyethy1enenitrite) tetra-acetic acid (EGTA) (Lynch et al. 1983) and Nitr-5 (Malenka et al. 1988) prevent that the Ca2+ influx through NMDA channels contributes to the EPSP potentiation. Since it is known that ionic flux through NMDA channels mediates the late component of an EPSP, we can simulate the effect of the chelators with an increase of the rate constant that represents all the processes involved with the EPSP decay, that is E. The model indicates that because of increase of E, the depolarization during a HFS is not sufficiently large to increase the retrograde signal to a level necessary to obtain LTP. Finally, two different types of potentiation have been reported in r e cent experiments (Gustafsson and Wigstrom 1990). They differ in the changes in the time course of the recorded EPSPs. In one case, only the peak value increases, without changes in their time evolution characteristics. In the other case, the potentiation is expressed as a prolongation of the onset time and a larger peak amplitude. In terms of equation 2.2, both findings can be explained as changes of y and E , such to maintain TD = l / ( y E ) constant or not, respectively.
+
A Kinetic Model of Short- and Long-Term Potentiation
645
4 Conclusions
Each of the (a)-(e) reactions should be considered (as already remarked) as representative of an equivalent set of more complex reactions. The purpose of the present work is to assess a simplified environment, qualitatively consistent with the available experimental data, on which one can expand one or more reactions in order to test a specific mechanism as suggested by experimental evidence. From this point of view, our model does not takes explicitly into account, for example, the kinetics of ionic fluxes through NMDA or other channels, or calcium-dependent protein kinases (Kitajima and Hara 1990). The detailed modeling of these or other processes is, of course, needed to define more precisely the still unknown biophysical mechanisms involved with LTP and to obtain quantitative agreement with experimental data. On the backbone of our set of reactions, any additional process can further modulate the induction or the expression of LTl? We believe that our model can be useful to stimulate discussion and experimentation in the field. The present model uses the hypothesis that part of the origin of LTP may be in the (presynaptic) increased release of neurotransmitter provoked by a retrograde signal produced by a postsynaptic mechanism (depolarization and/ or high calcium concentration). The simulations, using the model presented, further support this hypothesis because the results obtained are (1) in very good qualitative agreement with experimental data on STP and LTP, (2) consistent with Hebb’s postulate, (3) consistent with the expected time course for EPSPs in terms of molecular rate constants rather than empirically, and (4) consistent with and capable of reproducing the effects of several LTP modulators. In this case the model can also predict the possible kinetics of the retrograde signal itself (Fig. 2). Moreover, the model has shown (Fig. 4) that although the presence of a retrograde signal is enough to induce a potentiation very similar to STP, it is not sufficient to induce the maintenance of LTP, and other postsynaptic events at a molecular level (such as the change of the rate constant of its production) are also necessary.
Acknowledgments We thank Prof. M. U. Palma for a critical reading of the manuscript and making valuable suggestions, Prof. S. L. Fornili and Prof. A. Messina for useful discussions, and Mr. S. Pappalardo for technical assistance. This work has been carried out at IAIF-CNR and supported also by CRRN-SM local funds.
646
M. Migliore and G. F. Ayala
References
Bliss, T. V. l?, and Gardner-Medwin, A. R. 1973. Long-lasting potentiation of synaptic transmission in the dentate area of the unanaesthetized rabbit following stimulation of the perforant path. J. Physiol. 232,357-374. Bliss, T. V. P., and Lomo T. 1973. Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. J. Physiol. 232, 331-356. Bliss, T. V. P.,and Lynch, M. A. 1988. Long-term potentiation of synaptic transmission in the hippocampus: Properties and mechanisms. In Long-Term Potentiation: From Biophysics to Behavior, l? w. Landfield and S. A. Deadwyler, eds., pp. 3-72. Alan R. Liss, New York. Bredt, D. S., and Snyder, S. H. 1992. Nitric Oxide, A novel neuronal messenger. Neuron 8, 3-11. Byrne, J. H., and Berry, W. O., eds. 1988. Neural Models ofPlasticity. Experimental and Theoretical Approaches. Academic Press, New York. Colino, A., Huang, Y.-You, and Malenka, R. C. 1992. Characterization of the integration time for the stabilization of long-term potentiation in area CAI of the hippocampus. 1. Neurosci. 12, 180-187. Edwards, E. 1991. LTP is a long term problem. Nature (London) 350, 271. Gamble, E., and Koch, C. 1987. The dynamics of free calcium in dendritic spines in response to repetitive synaptic input. Science 236, 1311-1315. Gustafsson, B., and Wigstrom, H. 1990. Basic features of long-term potentiation in the hippocampus. Semi. Neurosci. 2, 321-333. Haley, J. E., Wilcox, G. L., and Chapman, P. F. 1992. The role of nitric oxide in hippocampal long-term potentiation. Neuron 8, 211-216. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Holmes, W. R., and Levy, W. B. 1990. Insights into associative long-term potentiation from computational models of NMDA receptor-mediated calcium influx and intracellular calcium concentration changes. J. Neurophysiol. 63, 1148-11 68. Kauer, J. A., Malenka, R. C., and Nicoll, R. A. 1988. NMDA application potentiates synaptic transmission in the hippocampus. Nature (London) 334, 250-252. Kelso, S. R., Ganong, A. H., and Brown, T. H. 1986. H&bian synapses in hippocampus. Proc. Natl. Acad. Sci. U.S.A. 83, 5326-5330. Kitajima, T., and Hara, K. 1990. A model of the mechanisms of long-term potentiation in the hippocampus. Biol. Cybern. 64, 33-39. Koch, C., and Segev, I., eds. 1989. Methods in Neuronal Modeling: From Synapses to Networks. MIT Press, Cambridge, MA. Lynch, G., Larson, J., Kelso, S., Barrionuevo, G., and Schottler, F. 1983. Intracellular injections of EGTA block induction of hippocampal long-term potentiation. Nature (London) 305, 719-721. Malenka, R. C., Kauer, J. A,, Zucker, R. J., and Nicoll, R. A. 1988. Postsynaptic calcium is sufficient for potentiation of hippocampal synaptic transmission. Science 242, 81-84.
A Kinetic Model of Short- and Long-Term Potentiation
647
Nicolis, G., and Prigogine, I. 1977. Self-Organization in Nonequilibrium Systems. From Dissipative Structures to Order through Fluctuations. Wiley, New York. Press, W. H., Flannery, 8. P., Teukolsky, S. A., and Vetterling, W. T. 1987. Numerical Recipes. The Art of Scientific Computing. Cambridge Univ. Press, Cambridge. Segev, I., and Rall, W. 1988. Computational study of an excitable dendritic spine. I. Neurophysiol. 60, 499. Received 5 June 1992; accepted 24 November 1992.
This article has been cited by: 2. Alain Destexhe, Zachary F. Mainen, Terrence J. Sejnowski. 1994. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. Journal of Computational Neuroscience 1:3, 195-230. [CrossRef]
Communicated by Anthony Zador and Christof Koch
Artificial Dendritic Trees John G . Elias Department of Electrical Engineering, University of Delaware, N m r k , DE 19716 U S A
The electronic architecture and dynamic signal processing capabilities of an artificial dendritic tree that can be used to process and classify dynamic signals is described. The electrical circuit architecture is modeled after neurons that have spatially extensive dendritic trees. The artificial dendritic tree is a hybrid VLSI circuit and is sensitive to both temporal and spatial signal characteristics. It does not use the conventional neural network concept of weights, and as such it does not use multipliers, adders, look-up-tables, microprocessors, or other complex computational units to process signals. The weights of conventional neural networks, which take the form of numerical, resistive, voltage, or current values, but do not have any spatial or temporal content, are replaced with connections whose spatial location have both a temporal and scaling significance. 1 Introduction
Interest in using artificial neural networks for the identification and control of dynamic systems is growing (e.g., Narendra and Parthasarathy 1990). However, most neural network models do not include spatiotemporal dynamic signal processing capabilities. In these models, the neuron is treated as a point entity that receives and processes inputs at the soma, which makes spatial signal processing difficult or impossible. In our modeling approach, we have looked beyond the soma to the extensive dendritic tree structure of neurons, which not only forms most of the cell's surface area but provides spatiotemporal signal processing capabilities not present in models that assume a point-entity neuron. The artificial dendritic tree described in this paper is a hybrid circuit and is sensitive to both temporal and spatial signal characteristics, but it does not use the conventional neural network concept of weights, and as such it does not require multipliers, adders, look-up-tables, or other complex computational units to process signals. The weights of conventional neural networks, which take the form of numerical, resistive, voltage, or current values, but do not have any spatial or temporal content, are replaced in our system with connections whose spatial location have both a temporal and scaling significance. Neural Computation 5, 648-664 (1993) @ 1993 Massachusetts Institute of Technology
Artificial Dendritic Trees
649
We have only recently begun to experiment with networks of artificial dendritic trees (Elias 1992a). We have fabricated and tested artificial dendritic branches in CMOS analog VLSI (Elias et al. 1992), and we have used a genetic algorithm to train simple networks to follow a maneuvering target that moves in one dimension (Elias 1992b). The research described here attempts to capture neurocomputational principles by applying structure and behavior modeled after synaptic and dendritic levels of biological implementation. We hope to demonstrate that electronic analogs of biological computational devices, that include the properties of spatially extensive dendritic trees and the impulse response of chemical synapses, can form the basis for powerful artificial neurosystems. 2 Artificial Dendrite and Chemical Synapse
In this section, we describe electronic circuits that (1) emulate the electrical behavior of passive dendritic trees and chemical synapses and (2) are simple and robust enough to ensure that networks, which ultimately need to support huge numbers of synapses, can be constructed with standard VLSI processing. Electronic analogs of active dendrite behavior (e.g., Llinas and Sugimori 1980; Shepherd et al. 1985, 1989; Hounsgaard and Midtgaard 1988) will not be treated in this paper. 2.1 Artificial Dendrite. Passive artificial dendrites are formed by a series of standard compartments, where each compartment has a capacitor, Cm,that represents the membrane capacitance, a resistor, Rm, that represents the membrane resistance, and an axial resistor, &, that represents the cytoplasmic resistance (e.g., Rall 1989). Figure la shows a section of artificial dendrite with five standard compartments that is part of a much longer branch like that shown in Figure lb. The transient response of the artificial dendrite is of primary importance. Figure l c shows the impulse response measured at point S due to inward impulse current at four different locations, A, B, C, and D on a passive artificial dendrite as represented in Figure lb. The location S represents the position of the soma. Therefore, the voltages measured at S are those that would affect somatic voltage-sensitive circuits and perhaps cause the generation of an efferent impulse. As with biological passive dendrites, the peak voltage amplitude is largest for transmembrane current nearest the soma and gets rapidly smaller for sites farther away. The time for the voltage to peak shows a similar behavior: time-to-peakvoltage increases with distance from S (e.g., Rall 1989). The behavior shown in Figure 1 illustrates how the concept of weight is an inherent property of the dendritic physical structure. It is clear that position along the artificial dendrite can be used to produce an effective weighting, in both time and amplitude, of afferent signals that are in the form of a transient inward or outward current.
650
John G. Elias
Figure 1: (a) Compartmental model of passive dendrite. Each RC section, Rm, &, and C,, is a standard compartment that simplifies VLSI layout. (b) Standard compartments are abutted on substrate to form silicon dendritic branches. (c) Measured impulse response of single artificial dendritic branch due to transient transmembrane current at indicated locations on branch. 2.2 Artificial Chemical Synapse. The means for enabling inward or outward impulsive current at a specific artificial dendrite location is accomplished by using a single MOS field effect transistor. A p-channel transistor enables inward current, which produces an excitatory type response, and an n-channel transistor enables outward current, which results in an inhibitory type response. The complete artificial dendrite circuit is shown in Figure 2a, where both p-channel (upper) and n-channel (lower) transistors are placed at uniform positions along the branch. The transistors are turned on by an impulse signal applied to their gate terminals. Both transistor types operate in the triode region (e.g., Allen and Holberg 1987). Therefore, the amount of transmembrane current depends on the conductance of the transistor in the on state, the duration of the gate terminal impulse signal, and the potential difference across the transistor, which is dependent on the state of the dendrite at the point of the synapse. All excitatory transistors have identical drawn dimensions (as do inhibitory transistors), and both excitatory and inhibitory artificial synapses are placed at the same locations in the current chip implementation. It is possible that after further system level experimentation, we may find that a different synapse distribution is preferred.
2.3 Electrical Response of Artificial Dendritic Trees and Synapses. In Figure 1, we illustrated the behavior of the impulse response amplitude as a function of the synapse position on the artificial dendritic tree, thus demonstrating the effective weighing of inputs that are mapped onto the tree structure. The impulse response amplitude as a function of the afferent impulse signal width is shown in Figure 3a, which represents measured responses from one of our VLSI circuits for four different
Artificial Dendritic Trees
651
S v
Figure 2: (a)A five compartmentsection of artificialdendrite with five excitatory and five inhibitory artificial synapses. V,,,, is the resting voltage, Vtop is the maximum membrane voltage. (b)A multibranched artificial dendritic tree that is constructed by piecing together artificial dendrite sections like that in (a). impulse widths. A similar postsynaptic behavior is found in biological preparations under presynaptic voltage-clamp: presynaptic depolarization produces a nearly linear increase in postsynaptic voltage (e.g., Angstadt and Calabrese 1991). This behavior may be due to a lengthening of the time over which transmitter is released, thereby increasing transmembrane current in the postsynaptic terminal. In any event, the efficacy of existing connections in our system can be changed by altering the impulse width. We are investigating how this may be done on a local basis, perhaps consistent with Hebbs postulate (Hebb 19491, such that both local synaptic strength and the location of the synapse on the branch combine to produce an effective synaptic weight for a given connection. The artificial dendrite's voltage response to closely spaced impulses is shown in Figure 3b. The response due to each synaptic event is added to the resultant branch point voltage from past events until the voltage reaches a maximum value. This behavior is the expected impulse response of an Nth-order system and is solely due to the effective postsynaptic membrane. The same behavior would be observed if the phasing of multiple, transiently conducting, artificial synapses was short compared to the effective membrane constant. An example that utilizes this behavior is discussed in Section 4. Multiple, simultaneously conducting synapses that are electrically close together produce a voltage at the soma that is less than the sum of their individual responses (e.g., Shepherd and Koch 1990). This sublinear effect is due to the shunting load seen by each synaptic site when electrically nearby synapses open their channels. In contrast, multiple, simultaneously conducting synapses that are electrically far apart produce a nearly linear resultant voltage at the soma. Both of these behaviors, as measured at the trunk of a two-branched artificial dendritic tree (e.g., point S in Fig. 2b), are shown in Figure 3c. The smaller voltage transient
John G. Elias
652
................ ......
op.
Figure 3: Experimental results from artificial dendrite-synapse circuit to afferent stimulation. (a) Graded response: amplitude of voltage peak at soma is linearly related to afferent impulse width over wide range. (b) Tetanus response: closely spaced impulses cause voltage response to saturate if impulse rate is faster than membrane decay time. (c) Nonlinear and nearly linear response: curve 1 is the resultant somatic voltage for simultaneous stimulation of two adjacent synapses on same branch (see Fig. 2b). Curve 2 is somatic voltage for simultaneous stimulation of two synapses on different branches. Positions of synapses for curves 1 and 2 were equidistant from soma.
(curve 1) was measured when two adjacent artificial synapses were simultaneously active on the distal end of one of the branches. The larger voltage transient (curve 2) shows the resultant voltage when two artificial synapses on separate branches were simultaneously active. In this case, the resultant is nearly twice that of the previous example. In both cases, the artificial synapses were equal distance from the point of measurement. This type of behavior clearly enriches the signal processing capabilities of systems comprised of spatially extensive dendritic trees (Koch and Poggio 1987). 3 Silicon Dendritic Trees
If artificial dendrites are to be used in real systems then they must be implemented via a process that can make huge numbers of them in a small area inexpensively. The only feasible path for doing this currently is with standard silicon processing methods (e.g., Mead 1989). In this section, we discuss briefly the implementation of a dendritic system in silicon. 3.1 Convergent, Divergent, and Recurrent Connections. Networks that are built with artificial dendrites and synapses process signals that have a spatiotemporal significance by mapping afferent signal pathways to specific locations on the dendritic trees. The connections between synapses and the outputs of sensors and neurons determine the overall system response for a. given dendritic dynamic behavior. The number of
Artificial Dendritic Trees
653
different connection patterns is quite large and is a factorial function of the number of synapses and sensor elements. If we limit, for the moment, the number of divergent connections to one, then the total number of different connection patterns is given by
N! ( N - M)! where N is the number of artificial synapses and M is the number of sensor and neuron outputs. Artificial systems may have thousands of afferents and many times more synapses, resulting in an extremely large number of possible connection patterns. In our system, we allow each sensor element or artificial neuron to make unrestricted divergent connections and each synapse to receive multiple convergent connections from both sensor elements and artificial neurons. This tends to make the number of possible connection patterns much larger than that indicated by equation 1.
3.2 Virtual Wires. In the implementation of an electronic system, the number of data pathways in or out of modules is limited by the available technology. Integrated circuit packages rarely exceed 500 pins; our current artificial dendrite chip is in a 40 pin package. This limitation in pin count is of special concern with dynamic artificial neuronal systems because of the analog nature of the computation. Each sensor or neuron output must be able to connect to any one of the artificial synapses in the system, and the spiking outputs from sensors and neurons must arrive at their artificial synapse destinations in a parallel fashion. In order to overcome I/O limitations and to meet connectivity and timing requirements, we make use of a multiplexing scheme that we refer to as virtual wires. In this scheme, the outputs of active neurons and sensors (i.e., those that are currently producing a spike or impulse) cause the synapses that they connect with to become activated after a delay that is programmable for each efferent connection. The process of reading an active output causes that output to return to the inactive (i.e., nonspiking) state. After all sensor and neuron output states have been sampled, the activated synapses throughout the system are turned on transiently by a global impulse stimulus signal. The process of determining active sensors and artificial neurons, delayed activation of synapses that connect to active sensors and neurons, and transiently turning on active artificial synapses continues indefinitely with a period that depends on system dynamics. Virtual wires are formed using four digital circuits: Stimulus Memory, which is closely associated with each synapse, Address Decoding, which serves all on-chip synapses, State Machine, which determines sensor and neuron output states, and Connection List, which specifies the locations of synapses and the axonal delay associated with each connection. Stim-
654
John G. Elias
CUA.'
Figure 4: Circuit diagram of one of the excitatory Stimulus Memory registers shown with its p-channel excitatory synapse transistor. When SET* is asserted the synapse is activated (i.e., excsyn*[nl is set to logic 0) and the artificial synapse will turn on when STIMULATE*is asserted. CLEAR* inactivates all Stimulus Memory locations throughout the system. Both STIMULATE* and CLEAR* are global signals in the system.
ulus Memory and Address Decoding are on-chip circuits; the Connection List and State Machine are off-chip. The circuit diagram of an excitatory Stimulus Memory register connected to its p-channel artificial synapse transistor is shown in Figure 4. In the current implementation, virtual wires add nine transistors to each artificial synapse. A synapse is activated when its excsyn*[n] is set to logic 0 by asserting SET' while CLEAR' is unasserted. The SET' signal is asserted by the on-chip address decoder when the proper combination of external address lines and control signal is asserted. An activated synapse will turn on when the global impulse signal, STIMULATE*, is asserted. The global signal CLEAR* is asserted after every STIMULATE* assertion to inactivate synapses in preparation for the next round of sampling and activation. The Connection List is a multiple-bit-wide memory that holds thg synapse addresses and axonal delay of each efferent connection in its domain. For large systems, we plan to divide the network into domains that will permit a certain level of parallel sampling of neuron and sensor outputs, which should enhance system scalability. The Connection Lists across all domains hold the pattern of connectivity for the system and thus their contents determine system behavior. A connection pattern can be fixed in ROM,or as in our present system, loaded via computer for experimentation. Figure 5 illustrates a simplified single-domain system comprising Connection List, sensor, State Machine, and four neuromorphic chips, each of which contains a number of artificial neurons. The outputs of the artificial neurons on each chip are sampled via a multiplexer, which is
Artificial Dendritic Trees
655
Figure 5 Simplified block diagram for single domain system showing basic operation. All sensory and neuronal outputs simultaneously activate the artificial synapses that they connect to through the virtual wires.
selected by the on-chip address decoder. Each neuron is in one of two states, so only a single output pin is needed to read all of them. In operation, the State Machine reads the state of each sensor element and every neuron in its domain. A spiking neuron or sensor output is detected by the State Machine, which then activates all of the synapses that connect to that particular sensor or neuron. After reading all outputs, STIMULATE* is asserted transiently, thus briefly turning on activated synapses. This is followed by asserting CLEAR*, which inactivates all synapses. As with Mahowald’s method of connecting neuron outputs to synapses (Mahowald 1992) addresses of synapses and neurons are used rather than direct connections carrying spikes.
3.3 Standard Dendrite Compartment. Figure 6 illustrates the basic integrated circuit layout of our standard dendrite compartment. Each compartment has a capacitor, C,,, that represents the membrane capacitance, a resistor, R,, that represents the membrane resistance, and an axial resistor, R,, that represents the cytoplasmic resistance. For the results
656
John G. Elias
Figure 6 Basic VLSI layout for standard dendrite compartment. Control lines for resistors permit adjustment of resistance over a limited range. Vmt establishes the resting voltage (typically 1 V). reported here, the size of the artificial dendrite standard compartment was 18 by 180 pm with most of this area being taken up by the capacitor. The capacitor is the largest element in the standard dendrite compartment and is implemented using conventional silicon processing methods (e.g., Allen and Holberg 1987). The capacitor was fabricated with two layers of polysilicon separated by a thin oxide layer. The top plate of the capacitor is polysilicon layer 2 (poly2) and connects to a ground bus that runs perpendicular to the long axis of the capacitor. The bottom plate is polysilicon layer 1 (polyl), which connects directly to the resistors, R, and R,,, and to the synapse transistors in the stimulus memory (see Figs. 4 and 6). The capacitance for the current size standard compartment capacitor is approximately 1 pF, which is based on an oxide thickness of approximately 700 A. There are many techniques to reduce the footprint of the capacitor while keeping the same capacitance: thinner dielectric, use material with a higher dielectric constant (e.g., silicon nitride), employ three-dimensional capacitors (e.g., trench or tower capacitors), but we will not explore these further here. The compartmental resistors may be implemented by a number of standard silicon fabrication techniques: well, pinched, active, and SC (e.g., Allen and Sanchez-Sinencio 1984). The resistor footprint for a particular resistance depends not only on the resistance value but also on the implementation technique. Well resistors are made by n- or p- diffusion and have a footprint advantage over the other techniques because the well resistor can be put under the capacitor. Therefore, a well resistor does not take up any silicon real estate but it has the disadvantage of relatively small resistance (measured as 5 kR per square for our chips). Pinched resistors have a higher resistance but can not be placed under the capacitor. Active and SC resistors are made from MOS transistor circuits that are designed to emulate resistor behavior over a certain range of terminal voltages and resistance values.
Artificial Dendritic Trees
657
We have implemented n-well, active, and SC resistors on different chips and will report on the details of their design and relative behavior later (Elias and Meshreki 1993). In our standard dendrite compartment, n-well resistors go under the capacitor and active or SC resistors are placed at the ends of the capacitor as shown in Figure 6. Independent control signals for changing the resistance of the SC or active R, and & pass along both sides of the compartment. For chips with active resistors, the control signals are DC voltages that permit a certain range of adjustment. With SC resistors, the control signals are AC voltages in which the frequency determines the resistance. Presently, the R,n resistors in all of the compartments share the same control signals and the R, resistors share a different set of control signals. Therefore, all compartments have nominally the same R, and & resistances. The standard dendrite compartment was designed to abut with adjacent compartments and was pitch-matched to the on-chip virtual wire circuitry. This method makes the construction of artificial dendritic trees a relatively simple task: to make a branch, standard compartments are placed side by side until the desired branch length is reached. Branches are then connected via metal or poly wires to form trees. The spacing between compartments is the minimum distance between capacitors (2 pm). The compartments are aligned such that the inputs of one compartment connect to the outputs of the previous compartment.
3.4 A Simple Silicon Implementation. The artificial dendritic tree circuit and its on-chip virtual wiring was fabricated using a 2-pm CMOS double-poly n-well process on a 2 by 2 mm Mosis Tiny Chip (e.g., Mead 1989). The artificial somata and output multiplexer were left off the chips to permit experimentation with different soma circuits. Four artificial dendritic branches each having 15 excitatory and 15 inhibitory artificial synapses were implemented on the chip. The number of synapses was kept low in order to leave open silicon areas on the chip for other analog test circuits. Figure 7 shows the complete artificial dendrite chip layout. In the current implementation, the four branches are in-line, with a gap in between each branch, and centered on the die. The ends of each branch are taken out of the chip through package pins to allow experimentation with different tree structures. Multiple chips can be combined as well to produce tree structures with more branches, longer branches, or higher order branching. The remaining circuitry makes up the virtual wires. We are currently working on several new chip designs that we expect will reduce the silicon area needed for the on-chip virtual wires. We are also investigating the addition of shunting inhibition as well as local synaptic weight storage that could be used to modify the effective weights of existing connections.
658
John G. Elias
Figure 7 Chip layout of artificialdendrites fabricated using a MOSIS 2-pm double polysilicon standard CMOS process. The four artificial dendritic branches can be seen in the center of the die. The ends of each branch are connected to pads, which allows experimentation with different branching structures. Each branch has 30 synapses (15 excitatory and 15 inhibitory),which are uniformly spaced along the branch.
4 Simple Test Circuits
Three simple, artificial dendritic tree experiments, two of which do not make use of temporal aspects of dendritic trees, are described next. In each experiment, the output shown represents the measured branch node voltage from one of our circuits for a period of time after the tree received impulsive afferent sensory signals. The sensory signals for each experiment were generated by a computer and were applied to the artificial synapses through a parallel interface. The sensor elements were set to a logic one if the sensor field was above a fixed threshold voltage and a logic zero otherwise. In each experiment, a sequence of sensor data over time was presented to the tree and the resultant waveform was captured with an &bit digitizer. Although binary level, one-dimensional sensor data were used, each test circuit would produce similar results with multilevel, two-dimensional sensor data, albeit with different connection patterns and branching structure. Sensor elements that are a logic one at time, t, cause an impulse signal to be applied to the gate terminals of the artificial synapses that they connect to through the virtual wires. Therefore, active sensor elements (i.e., those that hold a logic 1 at time t ) cause their respective artificial synapses to turn on transiently. Inactive sensor elements (i.e., those that hold a logic zero at time t ) do not cause any artificial synapses to turn on. Figure 8 .illustrates how a two-branched artificial dendritic tree can detect asymmetric patterns in the sensor field. In this experiment, the
Artificial Dendritic Trees
659
Figure 8: Input sensor with six elements connected to dual-branch artificial dendritic tree. The connection pattern shown classifiesinput patterns into symmetric and asymmetric classes. The sensor, with its data field, is shown at four different times, the last having a symmetric data pattern. Symmetric data fields result in a null output. Asymmetric sensor data fields produce either a positive or negative voltage trajectory. The output for the four sample times is shown at the right, which represents measured data from one of our circuits. Excitatory artificial synapses are top connections on each branch. Inhibitory artificial synapses are bottom connections.
sensor had 32 elements, but only six elements are shown here for simplicity. The same results would be obtained with virtually any size linear sensor array. The center of the sensor array represents the plane of symmetry, and the connections for a particular branch go only to sensor elements on one side of this plane. In this example, the top branch has one inhibitory and two excitatory connections. The bottom branch has one excitatory and two inhibitory connections. This connection pattern is not the only one that produces acceptable output. In general, there may exist a large number of connection patterns that produce acceptably good results. The sensor contents are shown at four different times, one of which contains a symmetric pattern. Not shown in the figure are the two trivial cases in which all elements are either logic one or logic zero. Both of these are symmetric and produce no output transient voltage. The case in which sensor elements are all ones activates all artificial synapses that are connected to these sensor elements, and the resultant signals sum approximately to zero at the branch point. Each connection from one side of the sensor plane of symmetry is mirrored with an opposite polarity connection on the other side that is equidistant from the branch point. When there is an asymmetric sensory pattern, as for sample numbers 0-2, there exists an imbalance between the activated artificial synapses, which results in a transient voltage at the branch point.
660
John G. Elias
Figure 9: Experimentsshowing the target-direction capabilities of dendrites (after Rall 1964). (a) Target (shaded sensor element)moves from right to left across sensor array. Resultant waveform is lower in peak voltage than in (b) where target moves from left to right. A simple thresholding device classifies target direction.
Figure 9 shows an experiment that follows an analysis of temporal processing in dendrites by Rall (1964). In this example, adjacent excitatory artificial synapses on one branch are connected to adjacent sensor elements. In this experiment, the sensor had 15 elements, but only eight elements are shown for simplicity. This simple connection pattern produces an output that is sensitive to the direction and speed of a moving target. Figure 9a shows the resultant branch point voltage for an eight segment time series in which a target moves across the sensor array from right to left. The target, in this case, is a logic one. The branch point voltage transients occur as the target moves across the sensor field, as can be seen in the plot of voltage vs. time. In Figure 9a, the artificial synapse nearest to the branch point is stimulated first and later followed by the stimulation of the next more distal synapse. The effect is that the resultant voltage transients arrive at the branch point well separated in time and therefore do not overlap significantly. In Figure 9b, the target moves from left to right. In this case, the most distal artificial synapse is stimulated first and followed later by stimulation of the more proximal synapses. The resultant voltage at the branch point is larger than it was for the case when the target moved in the opposite direction because the arrival times of the individual transients are more closely aligned. The classification of target direction is then completed by a simple comparator. Figure 10 shows a simplified circuit diagram that uses a single branch of artificial dendrite to provide a control signal for a maneuvering-target tracking application (Elias 1992~).The sensor, in this example, is a threshold device with seventeen elements, each of which outputs a one or a
Artificial Dendritic Trees
661
Figure 10: Input sensor with 17 elements connected to a branch of an artificial dendritic tree that responds to the position of a target in the sensor array. The connection pattern shown produces a dynamic response that depends nonlinearly on how far off the target is from the center. When the target is centered, the output (at S) is a null. As the target moves off center, the resultant voltage increases rapidly with separation distance between target and center. The sensor, with its data field, is shown at nine different times. Each sample time shows the target (in this case a 0) going off center and the resultant output (offset for clarity) from the artificial dendritic branch. The top half of the sensor array connects to only excitatory artificial synapses. The bottom half connects only to inhibitory synapses. The resultant voltage transients at S were captured using an 8-bit digitizer.
zero. In Figure 10, the one-dimensional sensor pattern over time is that of a simple maneuvering target, which, in this example, is a logic zero on a background of logic ones. In general, the target could be of any shape as long as it was distinguishable from the background. When the target is on center, the output of the dendritic branch is approximately zero. This is because all excitatory and inhibitory artificial synapses are simultaneously conducting, thereby cancelling each other. Small variations in target position around the center produce relatively small output voltage transitions which can be used for low gain system control. If the target moves below center, as shown in Figure 10, the resultant voltage transients are positive. If, however, the target moves above center, the transients are negative. As the target moves farther off center, either above or below, the resultant branch output peak voltage rapidly increases. This occurs because more proximal artificial synapses turn on, which, in effect, shifts the system control to higher gain. The relative amplitudes of the branch output voltage transients as a function
662
John G. Elias
of the distance between target and sensor center can be arbitrarily set by moving connections of particular sensor elements to either more distal or more proximal artificial synapses (Elias 1992b).
5 Summary and Discussion
In our research program, we have adopted Mead’s methodology (Mead 1989) for implementing neuromorphic systems: (1) study the relevant biological implementation; (2) extract the important computational principles; (3) make optimum use of the inherent properties of electronic devices; and (4) implement, using standard silicon processing techniques. In the work reported here, we have studied the properties of both active and passive biological dendritic trees as well as the dynamic and static behavior of chemical synapses. We have extracted principles of computation exhibited by passive dendrites with chemical synapses and have translated these principles to a simple and scalable electronic form implemented in standard CMOS technology. Although our electronic models of chemical synapse and passive dendritic tree are, in many respects, extreme simplifications of biological structures, their dynamic electrical behavior appears to satisfactorily follow that of their biological paragons. The artificial dendritic tree structure is based on a current understanding of passive dendritic trees, which results in an extremely simple circuit implementation that is highly scalable. Artificial neurons with extensive dendritic trees have the capability to process signals that have both temporal and spatial significance. In our networks, weights are replaced with connections that, when combined with the sublinear behavior of electrically close synapses and the nearly linear behavior of widely separated synapses, provide a rich computational substrate for signal processing.
Acknowledgments The author wishes to thank Peter Warter for several useful suggestions on chip architecture, Hsu Hua Chu, Samer Meshreki, and Sheela Sastry for assisting with chip layout, design, and testing, and the reviewers for many useful comments.
References Allen, I? E., and Holberg, D. R. 1987. CMOSAnalogCircuit Design. Holt, Rinehart & Winston, New York.
Artificial Dendritic Trees
663
Allen, P. E., and Sanchez-Sinencio, E. 1984. Switched Capacitor Circuits. Van Nostrand Reinhold, New York. Angstadt, J. D., and Calabrese, R. L. 1991. Calcium currents and graded synaptic transmission between heart intemeurons of the leech. J. Neurosci. 11(3),746759. Elias, J. G. 1992a. Spatiotemporal properties of artificial dendritic trees. Proc. Int. Joint Conf. Neural Networks, Baltimore 2, 19-26. Elias, J. G. 1992b. Genetic generation of connection patterns for a dynamic artificial neural network. IEEE Computer Society Press ProceedingsofCOGANN92, a workshop on combinations of genetic algorithms and neural networks. Elias, J. G. 1992c. Target tracking using impulsive analog circuits. In Applications of Artificial Neural Networks Ill, S. K. Rogers, ed. Proc. SPIE 1709, 338-350. Elias, J. G., Chu, H. H., and Meshreki, S. 1992. Silicon implementation of an artificial dendritic tree. Proc. Int. Joint Conf. Neural Networks, Baltimore 1, 154159. Elias, J. G., and Meshreki, S. 1993. Wide-range variable dynamics using switchedcapacitor neuromorphs. In preparation. Hebb, D. 0. 1949. The Organization of Behavior. Wiley, New York. Hounsgaard, J., and Midtgaard, J. 1988. Intrinsic determinants of firing pattern in Purkinje cells of the turtle cerebellum in vitro. J. Physiol. 402, 731-749. Koch, C., and Poggio, T. 1987. Biophysics of computation: Neurons, synapses and membranes. In Synaptic Function, G. M. Edelman, W. E. Gall, and W. M. Cowan, eds., Chap. 23. Wiley, New York. Koch, C., Poggio, T., and Torre, V. 1983. Nonlinear interactions in a dendritic tree: Localization, timing and role in information processing. Proc. Natl. Acad. Sci. U.S.A. 80, 2799-2802. Llinas, R., and Sugimori, M. 1980. Electrophysical properties of in vitro purkinje cell dendrites in mammalian cerebellar slices. J. Physiol. 305, 197-213. Mahowald, M. A. 1992. Evolving analog VLSI neurons. In Single Neuron Computation, T. McKenna, J. Davis, and S. F. Zornetzer, eds., Chap. 15. Academic Press, New York. Mahowald, M. A., and Douglas, R. 1991. A silicon neuron. Nature (London) 354, 515-518. Mead, C. 1989. Analog VLSI and Neural Systems. Addison-Wesley, Reading, MA. Narendra, K. S., and Parthasarathy, K. 1990. Identification and control of dynamical systems using neural networks. IEEE Transact. Neural Networks 1, 4-27. Rall, W. 1964. Theoretical significance of dendritic trees for neuronal inputoutput relations. In Neural Theory and Modeling, R. E Reiss, ed., p. 73. Stanford University Press, Stanford, CA. Rall, W. 1989. Cable theory for dendritic neurons. In Methods in Neuronal Modeling: From Synapses to Networks, Chap. 2. C. Koch and I. Segev, eds. MIT Press, Cambridge, MA. Shepherd, G. M., Brayton, R. K., Miller, J. F., Segev, I., Rinzel, J., and Rall, W. 1985. Signal enhancement in distal cortical dendrites by means of interactions between active dendritic spines. Proc. Natl. Acad. Sci. U.S.A. 82, 2192-2195.
664
JohnG. Elias
Shepherd, G. M., and Koch, C. 1990. Dendritic electrotonus and synaptic integration. In The Synaptic Organization of the Bruin, G . M. Shepherd, ed., appendix. Oxford University Press, New York. Shepherd, G. M., Woolf, T. B., and Carnevale, N. T. 1989. Comparisons between active properties of distal dendritic branches and spines: Implications for neuronal computations. 1. Cognit. Neurosci. 1, 273-286. Received 8 May 1992; accepted 11 November 1992.
This article has been cited by: 2. Christoph Rasche. 2007. Neuromorphic Excitable Maps for Visual Processing. IEEE Transactions on Neural Networks 18:2, 520-529. [CrossRef] 3. R. Jacob Vogelstein, Udayan Mallik, Joshua T. Vogelstein, Gert Cauwenberghs. 2007. Dynamically Reconfigurable Silicon Array of Spiking Neurons With Conductance-Based Synapses. IEEE Transactions on Neural Networks 18:1, 253-265. [CrossRef] 4. K.A. Boahen. 2004. A Burst-Mode Word-Serial Address-Event Link—II: Receiver Design. IEEE Transactions on Circuits and Systems I: Regular Papers 51:7, 1281-1291. [CrossRef] 5. K.A. Boahen. 2004. A Burst-Mode Word-Serial Address-Event Link—I: Transmitter Design. IEEE Transactions on Circuits and Systems I: Regular Papers 51:7, 1269-1280. [CrossRef] 6. C.M. Higgins, S.A. Shams. 2002. A biologically inspired modular VLSI system for visual measurement of self-motion. IEEE Sensors Journal 2:6, 508-528. [CrossRef] 7. C. Rasche, R.J. Douglas. 2001. Forward- and backpropagation in a silicon dendrite. IEEE Transactions on Neural Networks 12:2, 386-393. [CrossRef] 8. B.D. Brown, H.C. Card. 2001. Stochastic neural computation. I. Computational elements. IEEE Transactions on Computers 50:9, 891-905. [CrossRef] 9. K.A. Boahen. 2000. Point-to-point connectivity between neuromorphic chips using address events. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 47:5, 416-434. [CrossRef] 10. Timothy K. Horiuchi , Christof Koch . 1999. Analog VLSI-Based Modeling of the Primate Oculomotor SystemAnalog VLSI-Based Modeling of the Primate Oculomotor System. Neural Computation 11:1, 243-265. [Abstract] [PDF] [PDF Plus] 11. John G. Elias , David P. M. Northmore , Wayne Westerman . 1997. An Analog Memory Circuit for Spiking Silicon NeuronsAn Analog Memory Circuit for Spiking Silicon Neurons. Neural Computation 9:2, 419-440. [Abstract] [PDF] [PDF Plus] 12. David P. M. Northmore, John G. Elias. 1996. Spike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in DendritesSpike Train Processing by a Silicon Neuromorph: The Role of Sublinear Summation in Dendrites. Neural Computation 8:6, 1245-1265. [Abstract] [PDF] [PDF Plus]
Communicated by A. B. Bonds
Patterns of Local Connectivity in the Neocortex Andrew Nicoll Department of Physiology, School of Medical Sciences, University Walk, Bristol BS8 ZTD,UK
Colin Blakemore University Laboratory of Physiology, Parks Road, Oxford OX2 3PT, UK
Dual intracellular recording of nearby pairs of pyramidal cells in slices of rat visual cortex has shown that there are significant differences in functional connectivity between the superficial and deep layers (Mason et al. 1991; Nicoll and Blakemore 1993). For pairs of cells no farther than 300 pm apart, synaptic connections between layer 2/3 pyramidal neurons were individually weaker (median peak amplitude, A, of single-fiber excitatory postsynaptic potentials, EPSPs, = 0.4 mV) but more frequent (connection probability, p = 0.087) than those between layer 5 pyramidal neurons (mean A = 0.8 mV, p < 0.015). Taken in combination with plausible estimates of the density of pyramidal cells, the total numbers of synapses on them and the number of synapses formed on their intracortical axons, the present analysis of the above data suggests that roughly 70% of the excitatory synapses on any layer 2/3 pyramid, but fewer than 1% of those on a layer 5 pyramidal neuron, are derived from neighboring pyramidal neurons in its near vicinity. Even assuming very extreme values for some parameters, chosen to erode this difference, the calculated proportion of ”local synapses” for layer 5 pyramids was always markedly lower than for layer 2/3 pyramidal neurons. These results imply that local excitatory connections are much more likely to provide significant ”intracortical amplification” of afferent signals in layer 2/3 than in layer 5 of rat visual cortex. 1 Introduction
Pyramidal neurons are the major excitatory cells of the neocortex and almost exclusively constitute the output of the cortex (Peters 198%). Their axons make long-range projections, either to other cortical regions or to subcortical structures, but they also have collaterals forming extensive local arborizations within the same cortical area (e.g., Gilbert et al. 1990; Ojima et al. 1991; Kisvdrday and Eysel1992). A typical pyramidal neuron may possess in the order of 10,000 morphologically identified excitatory synapses (Larkman 1991)and yet, even in visual cortex layer 4, the major Neural Computation 5, 665-680 (1993) @ 1993 Massachusetts Institute of Technology
666
Andrew Nicoll and Colin Blakemore
target of thalamic afferents, only 20-30% of all synapses involve thalamocortical axons (Peters 1987b). Since nonpyramidal, excitatory neurons (spiny stellate cells) are rare, if not absent, in rat primary visual cortex (Peters 1987a,b), it seems likely that the majority of synapses on any particular pyramidal neuron derive from other pyramidal neurons. Morphological studies have indeed demonstrated that the commonest target of pyramidal cell axons are other pyramidal cells (Kisvdrday et al. 1986; Gabbott et al. 1987; Elhanany and White 1990). In addition, cross-correlation techniques and current source density analysis have provided evidence for the existence of “horizontal,” intrinsic connections in neocortex (e.g., Engel et al. 1990; Langdon and Sur 1990, respectively). However, if we are to model the computations performed by cortical circuitry, it is essential to have quantitative information about the origin and effectiveness of the synaptic inputs to individual cells. In previous studies, two methods have been employed to estimate the number of synapses needed to elicit an action potential in a postsynaptic cell (Andersen et al. 1990). One approach is to divide the difference between resting membrane potential and spike threshold (of a certain cell class) by the amplitude of the estimated size of a single, “quantal” excitatory postsynaptic potential (EFSP)(Martin 1984; Sayer et al. 1989; Andersen et al. 1990). However, effective spike threshold is not an invariant parameter because resting membrane potential depends on the prevailing level of tonic afferent activity, on the particular preparation used and on the method of recording. A second approach is to estimate how many presynaptic axons in a surgically isolated %rand of cortex that must be activated to discharge pyramidal cells in the target area when the strand is stimulated (Andersen et al. 1980, 1990; Sayer et al. 1989). This method suffers from the disadvantage that it is not possible to know the number of functional fibers in such a strand of tissue. The most direct way to look at synaptic convergence is to make simultaneous recordings between pairs of identified cortical cells. To improve on the above calculations, in this analysis, data from dual intracellular recording from pyramidal neurons in rat visual cortical slices have been combined with experimentally derived values from other aspects of synaptic distribution to estimate parameters of functional connectivity. We were interested to know, among other things, the proportion of a cell’s connections that is “local” as opposed to “long-range,” that is, what fraction of the total input to a pyramid might be provided by immediately neighboring pyramidal cells, information of special relevance to the visual cortex. 2 Determination of Connection Strength, Probability, and Distance
The strength and probability of occurrence of single-fiber synaptic connections between rat visual cortical pyramidal neurons were obtained
Local Connectivity in the Neocortex
667
from dual intracellular recordings in vitro (Mason et al. 1991; Nicoll and Blakemore 1993). Briefly, a pyramidal neuron was impaled intracellularly and then a second microelectrode, placed within an annular region (usually within a radius of 500 pm) centered on the first, was advanced until a second cell was obtained, usually within the same cortical layer. Single action potentials were elicited in the second cell by injection of depolarizing current pulses, and spike-triggered averaging was used to reveal any EPSP in the first cell. The first neuron was then stimulated to see if there was a connection in the other direction. All cells recorded had electrophysiological characteristics typical of pyramidal cells (McCormick et al. 1985) and whose morphological identity was confirmed for many cases by intracellular staining (see Mason et al. 1991). In layer 2/3, out of a total of 549 cell pairs tested, 48 were synaptically connected, equivalent to a connection probability, p, of 0.087. The median peak amplitude, A, of layer 2/3 single-fiber EPSPs was 0.4 mV (Mason et al. 1991). Within layer 5 however, only four connections (mean A = 0.8 mV) were found out of a possible total of 270, equivalent to p = 0.015 (Nicoll and Blakemore 1993). All EPSPs were monosynaptic (see Mason et al. 1991). Spike-triggered averaging of 50 to 100 sweeps at high gain was initially used to determine whether a cell pair was connected, and several hundred to several thousand sweeps were usually averaged when an EPSP was found. However, it should be noted that EPSPs smaller than about 0.03 mV would not have been detected with our methods. This lower connection probability and greater size of deep layer single-fiber EPSPs have also been observed by Thomson et al. (1992,1993). The separation of the microelectrode tips, d (and, therefore, of the two cell bodies), was calculated by trigonometry using the angles of microelectrode penetration and the depths of the cells within the slices. For the 48 connections between layer 2/3 cells, d ranged from 50 to 350 pm (mean of 150 pm). All connections except one (where d = 350 pm) were recorded between cells less than 300 pm apart. The distance between the cell bodies of the four connected pairs within layer 5 ranged from 70 to 150 pm (mean of 90 pm). The variation of p with d was not rigorously investigated, because that would have required that d values be known for all the unconnected pairs as well as those in which connections were found, data that were not recorded in the original experiments. There was a clear impression, however, that the chance of encountering a coupled pair increased as the microelectrodes were brought together. For layer 2/3 dual impalements, only one connection was found for d = 300500 pm. At least for the smaller sample of layer 5 dual impalements, no connections were found for d = 150-300 pm. The substantial difference in connection probabilities between the two layers therefore seems unlikely to be due to the difference in the distance range sampled. No serious attempt was made to look at connections over the millimeter range, although long-range connections (not necessarily monosynaptic) have been demonstrated in the cat (Ts’o et al. 1986). For the larger samN
N
N
668
Andrew Nicoll and Colin Blakemore
ple of layer 2/3 cell pairs, no correlation was found between EPSP onset latency and microelectrode separation (Mason, Nicoll and Stratford, unpublished results). 3 Analysis of Connectivity Patterns
The results of the analysis, incorporating experimentally derived data from dual intracellular impalement and based on the most plausible assumptions, are summarized in Table 1. 3.1 Layer 2/3 Local Connections. In layer 2/3, monosynaptic connections were found between pairs of pyramids no farther than 300 pm apart. Let us imagine one individual pyramid at the center of a “local sphere” that could be connected in all directions to a number of other cells no farther than that distance away (Fig. 1). How many other pyramidal cells are there in a local sphere of radius 300 pm to which one individual cell at its center could be connected? Assuming a uniform and constant density of cells within the local sphere and knowing the density of pyramidal neurons in layer 2/3 of rat visual cortex to be 65,000 cells mm-3 (calculated from Gabbott and Stewart 1987; Peters 1985,1987a1, the number of pyramidal cells contained in a local sphere is 7,345 (Table 1). The center-sphere cell will be connected only to a proportion of those cells though. Therefore, assuming that the layer 2/3 connection probability of 0.087 is uniform within the local sphere, one pyramidal cell at the center of the local sphere could be connected to around 639 other pyramids in this near vicinity. Let us take the case of the pyramidal neuron at the center of the local sphere being synaptically connected to another pyramid within the sphere. How many anatomical synapses (SA) mediate one pyramidpyramid connection? To our knowledge, there is no empirical value for SA in layer 2/3 of the rat visual cortex, but evidence from other species suggests the figure may be small (Kisvdrday et al. 1986; Gabbott et al. 1987). Let us assume, to an order of magnitude, that there are only 10 anatomical synapses in a typical inter-pyramid connection (Martin 19841, an estimate that receives a measure of support from electrophysiological studies (see below). If 639 ”local” pyramidal neurons provide input to one neuron at the center of our sphere of connection, then 6,390 anatomical synapses on that cell would come from those local pyramids (Table 1). However, a typical layer 2/3 pyramidal neuron possesses a total of -9,371 excitatory synapses (Larkman 19911, as estimated from counting dendritic spines. If we assume that the most distal 5%of these synapses (in layer 1) are unlikely to be contacted in “local” interactions then there are a possible maximum of 8,902 excitatory synapses available on an individual pyramidal cell. Therefore, 6,390/ 8,902 synapses
Local Connectivity in the Neocortex
669
Table 1: Patterns of Connectivity in Superficial and Deep Neocortical Layers, Considered in Terms of “Local Spheres” of Pyramidal Cellsu
1. Total number of neurons’ (mmP3) Number of pyramids’ (mm-j) “Connection distance,”* d (pm) Connection probability: p Volume of tissue contained in a “local sphere” of radius d (pm3) [ 4 / 3 ~ ( d ) ~ 1 Number of pyramids, N, within local sphere Number of pyramids, n, within local sphere conneded to one individual pyramid [p x Nl 2. Mean number of ”anatomical synapses,” SA, between two connected pyramids3 Total number of synapses, SL,mediating local pyramidal connections within the sphere [n X SA] Number of excitatory synapses, on one pyramid4 Proportion of synapses on one pyramid mediating local connections [SL/%I
s,
3. EPSP peak amplitude? A (mV) Quanta1 amplitude? a (mV) Number of quanta per trial, q [A/a] Presynaptic release probability: RP Number of ”physiological synapses,” Sp in one connection [q/RP; 1 release site = 1 synapse] 4. Total number of afferent fibers received by an individual pyramid [S.r/SPl [%‘/sAI
Layer 2/3
Layer 5
80 x 103 65 103 300 0.087
45 x 103 36 x 103 150 0.015
1.13 x lo8 7,345
0.14 x lo8 504
639
8
10
10
6,390
80
8,902
12,483
71.8%
0.6%
0.4 0.1 4 0.5
0.8 0.1 8 0.5
8
16
1,113 890
780 1,248
‘See Figure 1. The parameters were determined on the basis of the most plausible estimates available. Calculated on this basis, although the total number of connections on an individual pyramidal neuron may be similar in both layers (step 41, layer 2/3 pyramids have many more local connections than pyramids in layer 5 (step 2). Values of d and p were obtained experimentally (Nicoll and Blakemore 1993;Mason etal. 1991). Other figures were obtained from published literature or calculated as shown. Cited literature: ‘Gabbott and Stewart (1987),Peters (198.51,and Peters (1987a); ’Mason et al. (1991),Nicoll and Blakemore (1993);3Martin (1984);see also Kisvhrday et al. (1986)and Gabbott et al. 1987; 4Larkman (1991);5Sayer et al. (1989),Jack et al. (1990), Kom and Faber (1991),and Larkman et al. (1991).See text for details.
670
Andrew Nicoll and Colin Blakemore
Figure 1: The geometry of “local” neocortical connections. Imagine that an individual pyramidal neuron (large triangle) receives synaptic input from the local axon collaterals of some of the other pyramidal cells within a “local sphere” (unhatched region). The radius of the sphere is the separation, d, of connected cell pairs, determined from dual intracellular recording (as in Table l), viz. 300 pm for layer 2/3 and 150 IJm for layer 5. The probability of the center cell being connected to any other in the sphere, determined experimentally, was assumed to be constant within the local sphere.
or 71.8% of all the synapses on a layer 2/3 pyramidal neuron come from other
pyramidal neurons in its near vicinity. The number of synapses involved in a synaptic connection may also be estimated from physiological evidence. The median peak amplitude of a single-fiber or unitary EPSP of a layer 2/3 pyramid was found to be 0.4 mV (Mason et al. 1991). Let us assume a quanta1 amplitude of, say, 0.1 mV (e.g., Sayer et al. 1989; Larkman et al. 1991) and that the
Local Connectivity in the Neocortex
671
presynaptic release probability is, on average, 0.5 (Jack et al. 1990), as in hippocampal synapses. If one synaptic release site is equivalent to one physiological synapse (Korn and Faber 19911, then a 0.4-mV EPSP would be mediated by 8 “functional” synapses, a number similar to the 10 anatomical synapses constituting a typical inter-pyramidal connection suggested by Martin (1984). 3.2 Layer 5 Local Connections. Taking into account the lower pyramidal cell density in layer 5 (36,000 ~ ~ u n Gabbott - ~ : and Stewart 19871, the lower connection probability for layer 5 connections found experimentally (0.015)and using a “local sphere” of radius 150 pm calculated using the same assumptions as in layer 2/3, the number of pyramidal cells in the near vicinity possibly connected to one individual pyramid (at the center of the local sphere) in layer 5 would be only 8 cells (Table 1). For this calculation, we decided to assume again that a typical inter-pyramid connection is made by 10 anatomical synapses, SA,as for layer 2/3. This would make the total number of local inputs on to the center-sphere pyramidal neuron is 80 synapses. The mean synapse-toneuron ratio of both pyramidal cell classes in layer 5 is 13,870 (calculated from Larkman 1991). Because the pyramidal neurons in layer 5 are geometrically very long, let us assume that 90% of those excitatory synapses can mediate “local” connections (12,483; Larkman 1991). That being the case, the proportion of excitatory synapses on a layer 5 pyramid that derives from “local” sources would be only 0.6%, much lower than for layer 2/3 cells. We found mean EPSP peak amplitude to be about 0.8 mV for layer 5 connections, twice that in layer 2/3 (Nicoll and Blakemore 1993). If the same values of quanta1 amplitude and release probability are assumed as for layer 2/3, the number of “physiological synapses,” Sp, involved in a single-fiber EPSP of that amplitude would be 16, calculated in the same way as for layer 2/3. This is double the figure for layer 2/3, but still the same order of magnitude as SA.If a typical pyramidal neuron in layer 2/3 has 8,902 anatomical synapses and each connection it makes is mediated via 8-10 physiological or anatomical synapses, then one might suppose that one layer 2/3 cell receives between 890 and 1,113 afferent fibers (Table 1). A similar calculation for layer 5 reveals that a single pyramid there may receive 780-1,248 afferents. In other words, each cell class receives roughly the same number (1,000) of afferent inputs, although the amount of local input is much greater in layer 2/3. Although both figures have been used in Table 1, we have assumed a rough equivalence between SAand Sp (see Korn and Faber 19911, an assumption also employed below. 4 “Worse Case” Scenario
The analysis illustrated in Table 1 suggested that about 70% of the excitatory synapses on a typical layer 2/3 pyramidal neuron originate from
Andrew Nicoll and Colin Blakemore
672
Table 2: Connectivity Parameters Selected so as to Minimize the Differences in Local Connectivitybetween Layers 213 and 5 (”Worse Case Scenario”)“
1. “Connection distance,” d (pm)
Connection probability, p Number of pyramids, N,within “local sphere” Number of pyramids, n, within “local sphere” connected to one individual pyramid [p x N]
Layer 213
Layer 5
300 0.087 7,345
300 0.015 4,068
639
4,068
2
16
1,278
976
8,902
12,483
14.4%
7.8%
2. Mean number of “anatomical”or “physiological” synapses, SA or Sp, between two connected
pyramids Total number of synapses, SL,mediating local pyramidal connections within the sphere [n x SA] Number of excitatory synapses, ST, on one Pyramid Proportion of synapses on one pyramid mediating local connections [SL/%]
’The steps and assumptions from the literature are similar to those in Table 1but the values of d and SA or SP have been altered (see text). Even with this deliberate biasing of the analysis with very extreme parameters, the proportion of “local synapses” is still greater for layer 2/3 and layer 5, and the proportion for layer 5 still does approach that found for layer 2/3 in Table 1.
very nearby cells, while less than 1%of those on a layer 5 pyramid derive from other local pyramids. To test this conclusion, we performed a “worse case scenario” calculation by substituting very extreme values, deliberately chosen to abolish the lamina differences in local connectivity (Table 2). Employing the same connection probabilities, we altered two parameters: number of ”anatomical” or “physiological” synapses in a pyramid-pyramid connection, SA or Sp, and connection distance, d. For the analysis in Table 1, we made SA 10 for both layers 2/3 and 5. However, there is evidence from the superficial layers of cat visual cortex that SA may be even smaller (KisvBrday et al. 1986; Gabbott et al. 1987). In Table 2, we therefore made SA for layer 2/3 equal to 2, thereby substantially reducing the fraction of local synaptic inputs. For layer 5, we wished to make SA or SP as large as possible to upwardly bias the final result. We therefore used the estimate for S p of 16 in layer 5 (Table 1) although this number is somewhat higher than present evidence, at least for SA, suggests is feasible (Martin 19841, assuming SA and SP can be interchanged. In a further attempt to deliberately bias the result, d for layer 5 was increased. We based our estimate of d on empirical evidence from the dual impalement experiments, taking it to be the maximum
Local Connectivity in the Neocortex
673
distance at which cell pairs were actually found to be coupled, within the approximate 500 pm range explored. For layer 2/3, the value of d derived in this manner, 300 pm, corresponds quite well to the overall distribution of spines on the basal dendritic tree plus the probable local density of boutons on axon collaterals (Larkman et al. 1988; Mason et al. 1991). For layer 5, we found interactions only within the surprisingly short distance of 150 pm. Perhaps that is a reflection of different axonal morphology (e.g., Chagnac-Amitai et al. 1990) or, on the other hand, it is possible that the small number of connected pairs recorded in layer 5 by Nicoll and Blakemore (1993) provided an unrepresentative estimate of d. In the “worse case scenario” of Table 2, we therefore substituted d = 300 pm for layer 5, as for layer 2/3, although strictly unjustifiable on the basis of the data to hand (see below). Even with these changes, the proportion of local synapses calculated for a layer 5 cell was still only 7.8% but the proportion of local inputs onto layer 2/3 pyramids was substantially reduced. 5 Assumptions and Parameters
Unfortunately, this analysis is hampered by lack of directly applicable empirical data, so we have had to make reasonable assumptions for some of the parameters. The ”worse case scenario” calculation shows that although the results can be considerably altered by one’s choice of particular values, the overall approach and conclusion seem robust. 5.1 Connection Distance and Probability. Ideally, it would be desirable to derive a quantitative relationship between connection probability and cell separation. Although that information was not available for the present work, it would, in principle, be possible to describe p in terms of d with results from dual impalement experiments, although this would be laborious because d would have to be known for all cell pairs that were unconnected as well as connected. Also, the frequency of monosynaptic connections over larger distances could be low. For layer 2/3, we based the size of the “local sphere” (Fig. 1)on a radius of 300 pm, rather than the mean value for d. This was primarily because the value of p must be an overall estimate that applies to all the connections with cell separations no greater than that distance: the connection probability is unknown for subsamples with greater or smaller d. Similar assumptions were applied to layer 5 in the Table 1analysis, where no connections were found beyond d = 150 pm. Hence, it was implicit in our calculations that p was constant throughout the local spheres used. It was necessary to assume that the connection distances sampled corresponded to the actual cell separations over which monosynaptic connections are found at the observed probabilities. Local excitatory interactions between pyramidal neurons are primarily mediated by their
674
Andrew Nicoll and Colin Blakemore
basal dendrites and recurrent axon collaterals, which both extend over similar distances as used for the local spheres here. For pyramidal cells in both layers 2/3 and 5 of rat visual cortex, a minority of the dendritic spines are located at a path length of greater than about 150 pm from the cell body (Larkman 1991). Equivalent data are not available for presynaptic bouton distribution. However, bearing in mind the considerable overlap of basal dendrites and recurrent axon collaterals in pyramidal cells at least in layer 2/3 (Larkman et al. 1988; Mason et al. 19911, let us say, as a very rough approximation, that most presynaptic boutons are also located proximal to that distance as well. Hence, at least for layer 2/3, d = 300 pm tentatively represents the end of a range over which two pyramidal neurons may be connected through their local neurites. However, any dendritic similarities between layer 2/3 and 5 cells do not explain differences in connection probabilities. Perhaps they are a reflection of differing axonal rather than dendritic morphology. Layer 5 pyramids, especially burst-firing ones, may have more horizontal axon collaterals (Chagnac-Amitai et al. 1990), possibly with lower numbers of presynaptic boutons (Qima et al. 1991; Nicoll and Blakemore 1991). Although it may be difficult to apply results derived from a number of cells stained by gross extracellular dye injection to the present problem, Burkhalter (1989) noted, in rat visual cortex, that axons of layer 5 cells display clustered distributions whereas those of layer 2/3 cells do not, although there seems to be stronger evidence for clustered axonal projections in the superficial layers of the cat’s visual cortex (Kisvdrday et al. 1986; Kisvlrday and Eyles 1992). Still, even when d is artificially increased to 300 pm for layer 5 (Table 21, the proportion of local synapses does not approach that found in layer 2/3, mainly as a consequence of the low p found for layer 5. For further discussion of cortical connectivity see Braitenberg and Schiiz (1991). 5.2 Synapses between Pyramidal Cells. The number chosen for SA or S p can have considerable influence on the analysis. We initially assumed, to within an order of magnitude, there were 10 anatomical synapses (SA) between any two connected pyramids (Martin 1984; Step 2, Table 11, which compares well to the estimates of Sp. There is no direct empirical value of SA for our very specific situation in the rat, but the figure of 10 is reasonable on the basis of general morphological and electrophysiological considerations (Martin 1984). Gabbott et al. (1987) studied the synaptic contacts between two ”neighboring” (i.e., sharing a small patch of overlapping axon collaterals some distance from the somas) pyramidal cells in layer 5 of cat visual cortex under the electron microscope. Although probably underestimated, they found the number of anatomical synapses between the two cells to be only 4; Kisvdrday et al. (1986) suggested that this figure could be even lower for layer 3 pyramids in cat striate cortex. In Table 2, we explicitly assumed a direct correspondence between structural and functional synapses so that we
Local Connectivity in the Neocortex
675
could use a very high value in one stage of the calculation. It is unknown whether the values of SA and S p are different in layers 2/3 and 5. 5.3 Synapse-to-Neuron Ratio. Some of the most distal synapses of a pyramidal neuron, especially on the long cells of layer 5, may be unavailable for intralaminal, "local" connections, and a small adjustment was made in the calculation to take account of this. We found, however, that within reasonable limits, the value set for the total number of synapses even on a layer 5 pyramid (s:Table 1) does not overwhelmingly affect the proportion of local synapses (SL)to the total number (SL/%) calculated. In layer 5, when SA = 10, SL comes out as 80. Even if there were only, say, 8,000 excitatory synapses within layer 5 on one layer 5 cell (Larkman 1991), this still represents a small (1%)SL/!?q proportion. For layer 2/3, the number of local pyramids, and hence SL, is much greater than in layer 5. If the number of pyramids in close vicinity to one individual pyramid in layer 2/3 is 639 (Table 11, and the number of anatomical synapses in an inter-pyramid connection, SA, were 2, then SL would be 1,278. If we say there are 5,600 excitatory synapses distributed within layer 2/3 (Larkman 1991), this represents an SL/& proportion of 22.8% on any one layer 2/3 pyramidal cell, still much greater than for layer 5.
5.4 Single-Fiber and Single-Quantum EPSPs. We used estimates of quanta1 amplitudes, a, obtained from studies in the hippocampus for our arguments here, as no published data were available for rat neocortex. The EPSP amplitudes used were based on results from intracellular recording, which may differ from estimates obtained using whole-cell patch pipettes, where cell input resistances may be higher (Staley et al. 1992). With our estimation of physiological synapses (SP),we achieved a reasonable agreement between S A and SP and were able to very tentatively concluded that the number of connections received by a cell were similar for layers 2/3 and 5. We applied a value of a = 0.1 mV for connections in both layers 2/3 and 5 (Table 1) although there is no reason one way or the other to suppose they are the same or different. The reported value of a ranges from about 0.1 to 0.4 mV (see Mason et al. 1991) but an estimate at the lower end of that range was employed because that was considered more reasonable for neocortex, based on current indications (Mason, Nicoll, and Stratford, unpublished observations). For the sake of argument, if a is set at 0.4 mV but the same E S P amplitudes are used as found experimentally with sharp electrodes (0.4 mV, layer 2/3 and 0.8 mV, layer 51, then the number of physiological synapses mediating one single-fiber connection becomes 2 for layer 2/3 (as opposed to 8 for a = 0.1 mV) and 4 for layer 5 (16 for a = 0.1 mV; Table 1). This is in closer agreement to the possible values of anatomical synapses suggested
676
Andrew Nicoll and Colin Blakemore
by Kisvirday et al. (1986) and Gabbott et al. (1987) but, of course, a range of answers is possible, depending on the assumptions made. 6 Discussion
It has been suggested that there is a fundamental neuronal circuit underlying cortical function (e.g., Szentdgothai 1978) and that functional differences between cortical areas mainly result from different afferent and efferent connections of the common circuit (Peters and Sethares 1991). The visual cortex contains a precise “map” of the visual field and individual neurons, especially in the central field representation, generally have receptive fields that cover only a tiny fraction of the entire field, even in the rat, which lacks a very pronounced central specialization in its retina (see Sefton and Dreher 1985). This demands that the major suprathreshold excitatory input to each cortical cell must originate ultimately from the restricted region of the retina corresponding to the classical receptive field. This in turn implies that very local intrinsic connections, deriving from cells with overlapping receptive fields, are much more important than any distant “horizontal” connections in constructing the receptive field itself. Douglas and colleagues have proposed a “canonical microcircuit” for the neocortex to account for the intracellular responses of cat striate cortical neurons to stimulation of thalamic afferents (Douglas et al. 1989; Douglas and Martin 1991). No one cell or class of cells receives sufficient synaptic drive directly from the thalamus to make it fire at the high rates observed following presentation of an optimal visual stimulus. They suggest that excitatory drive originating from the thalamus is augmented by successive stages of intracortical recurrent excitation. They term this process “intracorticalamplification” and propose that it is mediated primarily by local excitatory connections between pyramidal cells. The optimal parameters for the canonical model involve pyramidal cells in the superficial and deep layer groups securing half their intracortical excitatory conductance from their own respective populations (Douglas and Martin 1991). Our results support this model for pyramids in layer 2/3 of the rat, which appear to receive around 70% of their synapses from other local pyramids, but not for cells in layer 5. The latter (at least in the rat) are likely to be dominated by other inputs. If only a minority of the excitatory synapses on layer 5 pyramidal neurons derive from either thalamic axons or local pyramids, where do the rest come from? Presumably their origins include (1) excitatory cells in other layers, especially pyramidal cells in layer 2/3, which have a prominent projection to layer 5 in rats (Burkhalter 19891, (2) long-range projections from other pyramids elsewhere in primary visual cortex, (3) possibly cortico-corticalbackprojections from other visual cortical areas (see Sefton and Dreher 1985). As for the cat (Ts’o etal. 1986; Gilbert etaf.1990),
Local Connectivity in the Neocortex
677
the input from distant regions of cortex probably provides information from outside the classical receptive field, which is therefore unlikely to act as an amplifier during localized visual stimulation. Pyramidal neurons in layer 2/3 and nonbursting pyramidal layer 5 neurons project to the contralateral hemisphere through the corpus callosum (Hallman et al. 1988; Hubener and Bolz 19881, but burst-firing cells in layer 5 project to a variety of subcortical targets such as the superior colliculus (Schofield et al. 1987; Hiibener and Bolz 19881, and/or the pons (Hallman et al. 1988). Chagnac-Amitai and Connors (1989a,b) suggested that layer 5 burst-firing neurons form a subnetwork of strongly, yet sparsely, connected neurons, a notion that the results of dual impalement studies would support. However, the exact visual role that local inter-pyramidal synaptic connections play in layer 5 remains unclear, especially as the amount of depolarization they produce is small relative to neuronal spike threshold. Unless the synapses of thalamocortical axons are exceptionally numerous or remarkably effective in depolarizing the cells, the "enabling depolarization," essential for the activation of layer 5 circuitry, must come from elsewhere. The most likely candidate for this enabling signal is the input from layer 2/3, which could then be amplified by both the thalamic and the local pyramidal inputs. 7 Conclusions
The analysis in this paper applies specifically to the visual cortex of the rat. However, to the extent that the neocortex might have common principles of circuitry, conserved across species, the results might be of more general relevance. As more precise values for Table 1 become available, parameters of connectivity can easily be revised and extended so that the numbers of synapses and neurons in whole neocortical columns can eventually be derived. Acknowledgments This work was supported by the Medical Research Council, The Wellcome Trust, and the Human Frontier Science Program. We would like to thank Dr. A. Larkman, Dr. P.Bush, Dr.T. Sejnowski, Dr. B. Connors, and Mr. A. Strassberg for helpful discussion regarding this work. References Andersen, P., Silfvenius, H., Sundberg, S. H., and Sveen, 0. 1980. A comparison of distal and proximal dendritic synapses on CAI pyramids in guinea-pig hippocampal slices in vitro. J. Physiol. (Lond.) 307, 273-299.
678
Andrew Nicoll and Colin Blakemore
Andersen, P., Raastad, M., and Storm, J. F. 1990. Excitatory synaptic integration in hippocampal pyramids and dentate granule cells. Cold Spring Harbor Syrnp. Quant. Biol. LV, 81-86. Braitenberg, V., and Schiiz, A. 1991. AnatornyoftheCortex: Statisticsand Geometry. Springer-Verlag, Berlin. Burkhalter, A. 1989. Intrinsic connections of rat primary visual cortex: Lamina organization of axonal projections. J. Comp. Neurol. 279, 171-186. Chagnac-Amitai, Y., and Connors, B. W. 1989a. Horizontal spread of synchronized activity in neocortex and its control by GABA-mediated inhibition. J. Neurophysiol. 61,747-758. Chagnac-Amitai, Y., and Connors, B. W. 1989b. Synchronized excitation and inhibition driven by intrinsically bursting neurons in neocortex. J. Neurophysiol. 62, 1149-1162. Chagnac-Amitai, Y., Luhman, H. J., and Prince, D. A. 1990. Burst generating and regular spiking layer 5 pyramidal neurons of rat neocortex have different morphological features. J. Comp. Neurol. 296, 598-613. Douglas, R. J., and Martin, K. A. C. 1991. A functional microcircuit for cat visual cortex. J. Physiol. (Land.) 440, 735-769. Douglas, R. J., Martin, K. A. C., and Whitteridge, D. 1989. A canonical microcircuit for neocortex. Neural Comp. 1, 480-488. Elhanany, E., and White, E. 1990. Intrinsic circuitry: Synapses involving the local axon collaterals of corticocortical projection neurons in the mouse primary somatosensory cortex. J. Cornp. Neurol. 291,43-54. Engel, A. K., Konig, P., Gray, C. M., and Singer, W. 1990. Stimulus-dependent neuronal oscillations in cat visual cortex: Inter-columnar interaction as determined by cross-correlation analysis. Eur. J. Neurosci. 2, 58f3-606. Gabbott, P. L. A., and Stewart, M. G. 1987. Distribution of neurons and glia in the visual cortex (area 17) of the adult albino rat: A quantitative description. Neuroscience 21, 833-845. Gabbott, P. L. A., Martin, K. A. C., and Whitteridge, D. 1987. Connections between pyramidal neurons in layer 5 of cat visual cortex (area 17). J. Comp. Neurol. 239, 364-381. Gilbert, C. D., Hirsch, J. A., and Wiesel, T. N. 1990. Lateral interactions in visual cortex. Cold Spring Harbor Symp. Quant. Biol. LV,663-677. Hallman, L. E., Schofield, 8. R., and Lin, C.4. 1988. Dendritic morphology and axon collaterals of corticotectal, corticopontine, and callosal neurons in layer V of primary visual cortex of the hooded rat. J. Comp. Neurol. 272, 149-1 60. Hiibener, M., and Bolz, J. 1988. Morphology of identified projection neurons in layer 5 of rat visual cortex. Neurosci. Lett. 94, 76-81. Jack, J. J. B., Kullmann, D. M., Larkman, A. U., Major, G., and Stratford, K. J. 1990. Quanta1analysis of excitatory synaptic mechanisms in the mammalian central nervous system. Cold Spring Harbor Syrnp. Quant. Biol. LV,57-67. Kisvbrday, Z. F., and Eysel, U. T. 1992. Cellular organization of reciprocal patchy networks in layer III of cat visual cortex (area 17). Neuroscience 46,275-286. Kisvbrday, Z. F., Martin, K. A. C., Freund, T. F., Magl6czky Zs., Whitteridge,
Local Connectivity in the Neocortex
679
D., and Somogyi, P. 1986. Synaptic targets of HRP-filled layer I11 pyramidal cells in the cat striate cortex. Exp. Brain Res. 64, 541-552. Korn, H., and Faber, D. S. 1991. Quantal analysis and synaptic efficacy in the CNS. Trends Neurosci. 14, 439-445. Langdon, R. B., and Sur, M. 1990. Components of field potentials evoked by white matter stimulation in isolated slices of primary visual cortex: Spatial distributions and synaptic order. J. Neurophysiol. 64, 1484-1501. Larkman, A. U. 1991. Dendritic morphology of pyramidal neurones of the visual cortex of the rat: 111. Spine distributions. J. Cornp. Neurol. 306, 332-343. Larkman, A. U., Mason, A., and Blakemore, C. 1988. The in vitro slice preparation for combined morphological and electrophysiological studies of rat visual cortex. Neurosci. Res. 6, 1-19. Larkman, A., Stratford, K., and Jack, J. 1991. Quantal analysis of excitatory synaptic action and depression in hippocampal slices. Nature (London) 350, 344-347. Martin, K. A. C. 1984. Neuronal circuits in cat striate cortex. In Cerebral Cortex, Vol. 2, E. G. Jones and A. Peters, eds., pp. 241-284. Plenum Press, New York. Mason, A., Nicoll, A., and Stratford, K. 1991. Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. J. Neurosci. 11, 72-84. McCormick, D. A., Connors, 8. W., Lighthall, J. W., and Prince, D. A. 1985. Comparative electrophysiology of pyramidal and sparsely stellate neurons of neocortex. J. Neurophysiol. 54, 782-806. Nicoll, A., and Blakemore, C. 1991. Differencesin inter-bouton distance between intracortical axons of different classes of pyramidal neurone in rat visual cortex. J. Anat. 179, 209-210. Nicoll, A., and Blakemore, C. 1993. Single-fibre EPSPs in layer 5 of rat visual cortex in vitro. NeuroReport 4, 167-170. Ojima, H., Honda, C. N., and Jones, E. G. 1991. Patterns of axon collateralization of identified supragranular pyramidal neurons in the cat auditory cortex. Cerebral Cortex 1, 80-94. Peters, A. 1985. The visual cortex of the rat. In Cerebral Cortex, Vol. 3, A. Peters and E. G. Jones, eds., pp. 19-80. Plenum Press, New York. Peters, A. 1987a. Number of neurons and synapses in primary visual cortex. In Cerebral Cortex, Vol. 6, E. G. Jones and A. Peters, eds., pp. 267-294. Plenum Press, New York. Peters, A. 198%. Synaptic specificity in the cerebral cortex. In Synaptic Function, G. M. Edelman, W. E. Gall, and W. M. Cowan, eds., pp. 373-397. Wiley, New York. Peters, A., and Sethares, C. 1991. Organization of pyramidal neurons in area 17 of monkey visual cortex. J. Cornp. Neurol. 306, 1-23. Sayer, R. J., Redman, S. J., and Andersen, P. 1989. Amplitude fluctuations in small EPSPs recorded from CAI pyramidal cells in the guinea pig hippocampal slice. 1.Neurosci. 9, 840-850. Schofield, B. R., Hallman, L. E., and Lin, C.-S. 1987. Morphology of corticotectal cells in the primary visual cortex of hooded rats. J. Comp. Neurol. 261,8597.
Andrew Nicoll and Colin Blakemore
680
Sefton, A. J., and Dreher, 8. 1985. Visual system. In The Rat Nervous System, Vol. 1, G. Paxinos, ed., pp. 169-221. Academic Press, Sydney. Staley, K. J., Otis, T. S., and Mody, I. 1992. Membrane properties of dentate gyrus granule cells: Comparison with sharp microelectrode and whole-cell recordings. J. Neurophysiol. 67,1346-1358. Szenthgothai, J. 1978. The neuron network of the cerebral cortex: A functional interpretation. Proc. R. Soc, London B 201,219-248. Thomson, A. M., West, D. C., and Deuchars, J. 1992. Local circuit, single axon excitatory postsynaptic potentials (EPSPs) in deep layer neocortical pyramidal neurones. SOC.Neurosci. Abstr. 18, 1340. Thomson, A. M., Deuchars, J., and West, D. C. 1993. Paired intracellular recordings reveal large single axon excitatory connections between deep layer pyramidal neurones in rat neocortical slices. J. Physiol. (London)459,479. Ts’o, D., Gilbert, C. D., and Wiesel, T. N. 1986. Relationship between horizontal interactions and functional architecture in cat striate cortex as revealed by cross-correlation analysis. J.Neurosci. 6, 1160-1170. ~~
Received 18 September 1992; accepted 26 January 1993.
This article has been cited by: 2. Boris Gourévitch, Jos J. Eggermont. 2010. Maximum decoding abilities of temporal patterns and synchronized firings: application to auditory neurons responding to click trains and amplitude modulated white noise. Journal of Computational Neuroscience 29:1-2, 253-277. [CrossRef] 3. Bryan Kolb, Ian Q. Whishaw. 1998. BRAIN PLASTICITY AND BEHAVIOR. Annual Review of Psychology 49:1, 43-64. [CrossRef] 4. Paul Bush, Terrence Sejnowski. 1996. Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. Journal of Computational Neuroscience 3:2, 91-110. [CrossRef] 5. Birgit Albowitz, Ulrich Kuhnt. 1995. Epileptiform Activity in the Guinea-pig Neocortical Slice Spreads Preferentially along Supragranular Layers-Recordings with Voltage-sensitive Dyes. European Journal of Neuroscience 7:6, 1273-1284. [CrossRef] 6. Tomoki Fukai. 1995. A model cortical circuit for the storage of temporal sequences. Biological Cybernetics 72:4, 321-328. [CrossRef] 7. Ekkehard M. Kasper, Alan U. Larkman, Joachim L�bke, Colin Blakemore. 1994. Pyramidal neurons in layer 5 of the rat visual cortex. I. Correlation among cell morphology, intrinsic electrophysiological properties, and axon targets. The Journal of Comparative Neurology 339:4, 459-474. [CrossRef]
Communicated by Paul Adams
Sensitivity of Synaptic Plasticity to the Ca2+ Permeability of NMDA Channels: A Model of Long-Term Potentiation in Hippocampal Neurons Erik De Schutter James M. Bower Division of Biology 216-76, California Institute of Biology, Pasadena, CA 91125 USA
We have examined a model by Holmes and Levy (1990)of the induction of associative long-term potentiation (LTP) by a rise in the free Ca2+concentration ([Ca2+1)after synaptic activation of dendritic spines. The previously reported amplification of the change in [Ca2+lcaused by coactivation of several synapses was found to be quite sensitive to changes in the permeability of the N-methyl-D-aspartate (NMDA) receptor channels to Ca2+. Varying this parameter indicated that maximum amplification is obtained at values that are close to Ca2+permeabilities reported in the literature. However, amplification failed if permeability is reduced by more than 50%. We also found that the maximum free [Ca2+lreached in an individual spine during synaptic coactivation of several spines depended on the location of that spine on the dendritic tree. Distal spines attained a higher [Ca2+lthan proximal ones, with differences of up to 80%. The implications of this result for the uniformity of induction of associative LTP in spines in different regions of the dendrite are discussed. 1 Introduction
Since Hebb (1949)first proposed that a synaptic modification based on the co-occurrence of pre- and postsynaptic activity might underlie learning, this idea has formed the basis for many models of network associative learning (Byrne and Berry 1989). Over the last decade, neurobiologists have been studying a physiological phenomenon known as long-term potentiation (LTP), which can have many of the associative properties on synaptic strengths that Hebb originally hypothesized (Nicoll et a2. 1988). Recent work in the hippocampus has implicated a particular membrane channel, the N-methyl-D-aspartate (NMDA) receptor channel, in a type of LTP that is clearly associative (Landfield and Deadwyler 1988). In this case, an increase in synaptic strength is induced when synaptic stimulation coincides with depolarization of the postsynaptic membrane. The Neural Computation 5,681-694 (1993) @ 1993 Massachusetts Institute of Technology
682
Erik De Schutter and James M. Bower
dependence on postsynaptic depolarization appears to rely on the release of a voltage dependent block of this channel by M$+ ions (Mayer et al. 1984; Nowak et al. 1984; Ascher and Nowak 1988). When this block is released, binding of glutamate to the NMDA channel causes an influx of Ca2+,a rise in free [Ca2+]in the dendritic spine (Regehr and Tank 1990; Miiller and Connor 19911, and a change in synaptic efficacy by an as yet not understood secondary mechanism. NMDA channels are permeable to Na+ and K+ as well as to Ca2+ (Mayer and Westbrook 1987; Ascher and Nowak 1988). In most experimental studies on LTP total ionic current through the NMDA channel is measured. The Ca2+ influx is only a small fraction of this total current and it is usually not measured separately, despite its crucial role in the induction of LTP. This distinction may be important because it is known that the Ca2+permeability of other glutamate receptor channels can vary depending on the subunit composition of the channel receptor complex (Hollmann et al. 1991). The apparent association between LTP and conditions for associative memory (Landfield and Deadwyler 1988) has made LTP the subject of a growing number of modeling efforts (Gamble and Koch 1987; Holmes and Levy 1990; Zador et al. 1990). Given its putative role in actually triggering the induction of a synaptic change, the Ca2+ influx and the rise in the cytoplasmic Ca2+concentration ([Ca2+l)in the dendritic spine have been a central focus of this work. Holmes and Levy (1990) have used their modeling results to argue that the simple influx of Ca2+alone is not enough to account for associative effects. Instead they stated that associative LTP could be controlled by a steep nonlinearity in the relation between [Ca2+land the number of coactivated synapses. To show this relation, they used an anatomically reconstructed hippocampal dentate granule cell to build a structurally realistic model (HL model) that included NMDA and non-NMDA receptors on dendritic spines, Ca2+diffusion, a Ca2+buffer, and a pump. With this model they demonstrated that while the Ca2+influx increases only moderately if a large number of synapses are coactivated, the resulting internal free [Ca2+lcan increase 20- to 30-fold. To make their case Holmes and Levy explored several of the important parameters of their model. For example, they demonstrated that the amplification result was robust for changes in Ca2+buffer binding characteristics and buffer concentrations. However, they did not examine the dependence of their results on the Ca2+permeability of the NMDA channel. In this paper we have explored the consequencesof changing the Ca2+ permeability, that is, changing the size of the Ca2+ influx for a given NMDA-current. We have reconstructed the original HL model within the GENESIS simulation environment and have replicated the previously published results. In addition, we have shown that the maximum free [Ca2+]after NMDA channel coadivation is actually quite sensitive to
Sensitivity to Ca2+ Permeability of NMDA Channels
683
the Ca2+permeability. We have also demonstrated that there may be a considerable difference in the peak [Ca2+]expected in spines located in different regions of the dendrite. This result reinforces the idea that the induction of LTP does not only depend on the properties of the NMDA channel and of the Ca2+buffers, but also on the electrical structure of the postsynaptic neuron. 2 Implementation of the Holmes-Levy Model in GENESIS
The HL model was ported to the Caltech neuronal simulation system, GENESIS (Wilson et al. 1989). Two major changes were made to the model. The compartmental structure of the model was simplified to reduce the computational overhead and the equations for the conductance of NMDA and non-NMDA channels were changed to a standard type (Perkel et al. 1981). We modeled 98 dendritic spines, each with 7 compartments representing a cylindrical head of 0.55 by 0.55 pm on a neck of 0.10 by 0.73 pm (Fig. 3 of HL). Each compartment in the spine contained a calcium buffer (100 pM, 200 pM at the top of the head), a Ca2+ pump and Ca2+could diffuse between spine compartments and into the neighboring dendrite. At the top of the spine head NMDA and non-NMDA channels were located. In contrast to the HL model, a standard reaction scheme (Perkel et al. 1981) implemented in GENESIS was used for the NMDA and nonNMDA conductance:
A is the transmitter, R the receptor, AR the closed transmitter-receptor complex, and AR' the open transmitter-receptor complex. The original HL model used two variants of the reaction scheme leading to AR'. In the current model the rate factors in equation 2.1 have been optimized to give the same values of AR' as the HL model (cf. their Fig. 4). Because the value of AR' is the only variable directly relevant to the modeling conclusions, these differences in the reaction scheme are not pertinent. Values used for the non-NMDA channel were rg 600 msec, Th 1.25 msec, 7d 0.001 msec, K, 2 pM, g = 5 pS. For the NMDA channel rg 283 sec, 7h 5.88 msec, Td 0.002 msec, K, 50 pM, g = 50 pS. Equations for the voltage-dependent M g + block of NMDA channels and for Ca2+diffusion, buffers, and pumps were identical to those in the HL model. The fraction of NMDA current carried by Ca2+was based on the permeabilities to Ca2+,Na+, and K+ of NMDA channels (Goldman equations of Mayer and Westbrook 1987). Figure 1 shows that within the voltage and [Ca2+lranges used in the model, only voltage changed this fraction. Because Ca2+ inflow was computed as a fraction of total
Erik De Schutter and James M. Bower
684
L
a 0.50
Y
a
:
2 0 C
0.40
0.30
,o 0.20 c,
-
!+i
5
0.10
0.00 -100
-80
-60
-40
-20
0
Membrane potential (mV)
Figure 1: Dependence of the fraction of total NMDA current carried by Ca2+ on membrane potential for five different concentrations of internal [CaZ+l(0.02, 0.20,2.00,20,and 200 pM) at an external [Caz+lof 2 mM. Because total NMDA current becomes zero around 0 mV (the reversal potential), the solution becomes asymptotic close to this potential. Note that most of the curves overlap, only at internal Ca2+ concentrations above 100 pM does internal [Ca2+laffect this fraction. NMDA current, there was no need compute the Ca2+ Nemst potential. The reversal potential of the NMDA current itself changed by less than 0.1 mV for a change in internal [CaZ+lfrom 20 nM to 2 mM (which is the external [Ca2+1). The total number of compartments in the model was 1192. The same cable parameters were used as in the HL model. The spines were randomly distributed over two 165-pm-long dendritic segments, which each contained 98 dendritic spines (inset in Fig. 4). The rest of the compartmental model was highly simplified, having only 20 compartments that represented the soma and 5 other dendrites. This simplification was possible because, under conditions where the active spines are the only source of depolarizing currents, the soma and other dendrites act only as a current drain. Accordingly, as long as the passive load corresponding to the soma and these dendrites was correct, a simplified model produces the same results for the dendritic spines as a detailed model. The input resistance at the soma was 74.4 MR, compared to 72.4 MR in the HL model.
Sensitivity to Ca2+Permeability of NMDA Channels
685
To quantify model results, relative Ca2+ permeability was defined as the ratio between the value used in a particular simulation and the Ca2+ permeability reported by Mayer and Westbrook (1987). A relative Ca2+ permeability of one was thus the experimental value, which corres onded to 12.8% of the NMDA current at -70 mV being carried by Ca ions (Fig. 1). We have adopted the same definition of amplification ratio as introduced by HL, that is, the ratio between maximum free [Ca2+]in a particular spine after coactivation of 96 synapses over the maximum free [Ca2+]after activation of the synapse on that single spine. The stimulus paradigm for induction of LIT was 8 pulses at 200 Hz as in HL. The GENESIS implementation of the HL model described in this paper can be obtained by ftp from babel.cns.caltech.edu.
E
3 Results
In general, our implementation of the HL model within the GENESIS simulation software gave qualitatively equivalent results to those reported by Holmes and Levy. The change in membrane potential and NMDA channel conductance in two dendritic spines during coactivation of 96 spines is compared in Figure 2 with the original data from HL (their Fig. 7). The small differences in peak values were probably caused by sampling, because the size of the responses to synaptic activation was different in every spine. Because we implemented the same Ca2+mechanisms as employed in the HL model, our model also reproduced their computations of [Ca2+lexactly (results not shown). The main new modeling results are presented in Figure 3. Figure 3A shows the sharp dependency of the amplification ratio on the relative Ca2+permeability of the NMDA channel. For 96 coactivated spines versus 1 spine, under the standard conditions of the HL model, the amplification curve peaked at a relative permeability of about 1.25. At lower permeabilities the amplification ratio declined steeply and it dropped below 5 at a relative permeability of 0.50. At higher permeabilities it slowly declined to an amplification ratio of about 10. This dependency was similar for all dendritic spines, independent of their location; there was only a difference in amplitude. The relation between maximum free [Ca2+1and relative Ca2+permeability was shallow and nonlinear for low permeabilities, and steeper and linear for higher permeabilities (Fig. 3B). A peak appeared in the amplification ratio to permeability curve because the linear part started at lower permeability values for activation of 96 spines than for 1 spine. We also examined the effect of changing important parameters in the HL model on the amplification ratio to permeability curve. Changes in the buffer concentration in the spine head changed the location and size of the peak, but not the general shape of the curve (Fig. 3C). For lower
Erik De Schutter and James M. Bower
686
A
C -10
-10
.a
-20
s.a
i:
s-30
P -40
1
-50
a0
do
.m
.m 0
30 40
1PO 140 100
0
20
40
60
10 100 120 140 160
Tlmo (ma)
Figure 2: Spine head membrane potential and NMDA receptor-mediated synaptic conductancefor two different spines during coactivationof 96 spine synapses at 50 and 200 Hz. A and B are the original figures of Holmes and Levy (1990, courtesy of The American Physiological Society) and C and D are the corresponding figures produced by the implementation of their model described in this report. In C and D the responses in a distal spine (upper lines) and a proximal spine are shown. The responses in the distal spine are always bigger than in the proximal one. (A, C) Membrane potential as a function of time. (B, D) NMDA channel conductance at the synapse on the same spine heads. Proximal spine is spine #1, distal spine is spine #6 (see Fig. 4).
buffer concentrations the peak was smaller and occurred at smaller relative Ca2+ permeabilities. The reverse was true for higher buffer concentrations. Changes of the rate constant of the calcium pump by a factor of 2 did not change the amplification ratio versus permeability curve and changed the maximum free [Ca2+lby less than 1%(results not shown). We found the amplification ratio to permeability curve to be quite sensitive to the amount of transmitter released presynaptically (A in equation 2.1), which would affect both the NMDA- and non-NMDA-mediated components of the postsynaptic response (Fig. 3D). Doubling the amount of transmitter released per stimulus sharpened the peak considerably and shifted it to low relative Ca2+permeability values (peaking at about 0.5).
Sensitivity to Ca2+Permeability of NMDA Channels A
687
B
1
0
Flrlrtlve
2
3
cep.nnubllity
4
1
2
3
4
Reiatlve Ca2*permeability
D
C
1,. 50
0
0
801
r
I
1
2
3
Flrlatlvr c3' p.mbility
4
OJ 0
1
2
3
4
Relative W*pemeabiilty
Figure 3 Amplification ratio after a 200 Hz stimulus as a function of relative Ca2+ permeability under different model conditions. (A) Standard HL model amplification at two different spines, located distally (upper line, spine 6 in Fig. 4) and proximally on the dendrite (spine 1 Fig. in 4), is compared. (B)Maximum [Ca2+]as a function of relative Ca2+permeability after activation of 1 or 96 spines in the proximal spine. (C) Effect of changing buffer concentration [B] in the spine head on the amplification ratio. (D)Effect of changing the amount of transmitter released per stimulus (A) on the amplification ratio. Halving the transmitter release flattened the curve so that no clear peak could be distinguished; it also greatly diminished the amplification ratio at all levels of Caz+ permeability. As our results suggested a big variation in amplification ratios between different spines, we compared the time course of [Ca2+lfor spines located on different parts of the dendritic tree (Fig. 4). Because of the passive electrical properties of dendrites, distal regions of the cell were likely to be more depolarized than proximal regions for the same amount of input (Fig. 2C). This was a consequence of the passive load of the soma and other dendrites on proximal regions. This in turn means that NMDA channels on distal spines were less blocked by MgZ+ than channels in proximal spines (Fig. 2D).As a result, the Ca2+ concentration reached peaks in distal spines that were 20 to 80%higher than in proximal spines. Further, because there was very little difference in maximum free [Ca2+l after activation of a single synapse (2 to 3%, depending on the relative
Erik De Schutter and James M. Bower
688
3 2
s
2
v
I
5
3.
I
0'
20
40
60
80
100
120
140
160
180
Time (ms)
Figure 4 Calcium concentration as a function of time in six spines at different dendritic locations during coactivation of 96 spines. The location of the spines is shown on the gray schematic at the upper right; the six spines are shown in black. Ca2+permeability), differences in amplification ratio also varied between 20 and 80% (Fig. 2). This effect was most pronounced for relative Ca2+ permeabilities of 0.5 to 1.5, which straddle reported experimental values. Within spine heads, there was almost no gradient of free Ca2+or free buffer. For example, 18 msec after the last stimulus at 200 Hz, when free [CaZ+]reached its peak value, [Ca2+]was 30.20 pM under the membrane and 29.37 pM at the base of the spine head. There was however a big gradient over the spine neck, as [CaZf]in the underlying dendritic shaft was only 0.09 pM. 4 Discussion
In this paper we have extended the examination of the parameter space in a previously published biophysical model of associative LTP (Holmesand Levy 1990). While this is the most detailed model of [Ca2+]changes during activation of NMDA receptors on dendritic spines published to date, other models of LTP related changes in [Ca2+]have been reported. For ex-
Sensitivity to Ca2+ Permeability of NMDA Channels
689
ample, an older model by Gamble and Koch (1987) predicted changes in internal calcium concentrations during LTP. However, this model did not explicitly make use of NMDA channels. A second, more recent model by Zador etal. (1990) has essentially the same components as the HL model, but simulates only one spine on a compartmental model of a pyramidal cell and uses a fixed Ca2+ permeability that is independent of voltage (compare to Fig. 1). We expect that the changes in relative Ca2+permeability described here would have a similar effect in this model. Other models of LTP (Kitajima and Hara 1990) or NMDA channels Uahr and Stevens 1990)were not constructed to simulate realistic changes in [Ca2+]. The principal results described here are the apparent sensitivity of the Holmes and Levy model to the Ca2+ permeability of the NMDA channel and to the dendritic position of the activated spines. These properties were a direct consequence of the [Ca2+]amplification mechanism on which the HL model is based, that is, buffer saturation in the spine head. Optimal amplification happened when most of the inflowing Ca2+ was bound to buffers after a single spine was activated, while coactivation of many spines saturated buffers completely and caused consequently a large rise in free [Ca2+]. The buffers in the spine head could saturate, because diffusion out of the spine was restricted as shown by the large drop in [Ca2+]over the spine neck. The HL model fits within present theories of the physiological function of dendritic spines, which emphasize the compartmentalization of chemical processes in the spine head (Koch et al. 1992). It has been shown in large cells that diffusion of Ca2+ buffers may have a profound effect on dynamic changes in [Ca2+l(Sala and Hernlndez-Cruz 1990). The HL model does not simulate diffusion of buffers, but based on our results we do not think that such diffusion plays any role in this system. Because of the small volume of the spine head, there was almost no gradient of free or bound Ca2+buffers. There was a big gradient over the spine neck, but presumably buffers would diffuse much slower through this restricted space than Ca2+itself. The interaction between buffer saturation and relative Ca2+ permeability of the NMDA channel produced several interesting results, some of which may be counterintuitive. For example, increasing the amplitude of the synaptic conductance actually decreased the sensitivity of the system to distinguish between activation of a few or a lot of spines. This was evidenced by the small drop in amplification ratio at a relative permeability of 1.0 (Fig. 3D). If synaptic conductance was increased even more, the decrease would have become more pronounced. Decreasing the synaptic conductance was worse, because the amplification ra ti0 dropped below 5. These results are important as both a short-term and long-term potentiation and a depression of synaptic conductance have been described at hippocampal synapses (Larkman et al. 1991; Malenka 1991). Surprisingly, changing the amount of buffer in the spine head had much less effect on the amplification ratio at a relative Ca2+permeability
690
Erik De Schutter and James M. Bower
of 1.0 (Fig. 3 0 , as was also pointed out by Holmes and Levy. At higher dative Ca2+ permeabilities, higher buffer concentrations increased the sensitivity of the system. Note that decreasing the buffer concentration could not fully compensate for large decreases in Ca2+permeability. It has not been proven that a nonlinear amplification of [Ca2+lis the critical feature in associative LTP. For example, if the next step in the induction of LTP (e.g., activation of a Ca2+-dependentkinase; Miller and Kennedy 1986) has a sharp, nonlinear dependence on [Ca2+l,then such a mechanism might be robust enough to operate with smaller changes in [Ca2+l(Zador et al. 1990). However, recent imaging experiments do show increases in [Ca2+]from a resting level of 0.05 to 1.30 pM in dendritic spines under conditions that are expected to induce LTP (Miiller and Connor 1991). Holmes and Levy argue that the nonlinearity underlying the induction of associative LTI' should be as steep as possible and they eliminate Ca2+influx itself as a potential inductor because it is amplified by a factor of only 3. Combining this argument with the experimental data, it seems reasonable to assume that a safe amplification factor for the induction of associative LTP should be at least 10. Thus, we have shown that diminishing the Ca2+permeability by 50% makes the amplification ratio too small to function as a reliable inductor of associative LTP. The same is true for a decrease in the synaptic conductance. Increasing the Ca2+permeability changed the amplification ratio also, but it never dropped below 10. Holmes and Levy did not report the effects of changing these critical model parameters on the predictions made by their model. We have also shown that the location of a particular dendritic spine with respect to the electrical structure of the entire cell may have a profound effect on its participation in LTP. Our results suggest that LTP may be a cooperative phenomenon that besides nonlinear interaction between NMDA channels also involves the structure of the entire postsynaptic neuron. The interaction of Ca2+effects with the passive electrical properties of the cell's dendrite could result in changes of the amplification ratio of up to 80% depending on the particular position of a spine. Whether the difference between a peak [Ca2+l of 18.5 versus 29.8 pM (Fig. 4) would also cause quantitative differences in the amount of LTI' induced is unknown at present. This can be determined only after more experimental data have become available, so that biochemical models of the processes involved in LTP induction can be developed. The current simulations have examined the somewhat unlikely occurrence of activation in only two dendritic segments. Similar effects could be produced if network circuitry results in differential activation and/or inhibition of different regions of a particular dendrite. In this regard, our network simulations of the olfactory piriform cortex (Wilson and Bower 1992) and neocortex (Wilson and Bower 1991) make it seem quite likely that the laminar organization of both the cortex and the hippocampus (Brown and Zador 1990) could easily produce such differential effects
Sensitivity to Ca2+Permeability of NMDA Channels
691
on pyramidal cells. In this context, Me1 (1992) reported that a modeled cortical pyramidal neuron with NMDA channels responds preferentially to clustered synaptic inputs versus distributed ones. Location-dependent differences in the magnitude of LTP have been reported in the piriform cortex by Kanter and Haberly (1990). They reported, however, an inverse relation to what the model predicts, that is, LTP induced by association fibers on the proximal parts of the dendrite was larger than that induced by the more distally located afferent fibers. This discrepancy can be explained by several factors, among them specific differences in the NMDA receptors themselves (see further) and the effect of somatic action potentials, which would depolarize proximal NMDA channels more than distal ones (and thus remove the effect of the voltage-dependent Mg2+block). There are several possible consequences of such a location dependence. It is conceivable, for example, that variations in amplification effects in dendritic regions could reflect functional differences in projecting fiber systems. It may be that the operation of a particular neuron would depend on excluding synapses in certain positions from participating in LTP even in the presence of NMDA receptors. For example, there are several examples known where NMDA receptors are present but LTP has not been demonstrated (Artola and Singer 1987). In this case the electrical properties of some neurons may not support the amplification effects shown in the HL model. As pointed out above, in other cases spread of somatic action potentials into the proximal parts of the dendritic tree might counteract the location dependence. While it is interesting to speculate on the possible effects of cell structure and changes in presynaptic transmitter release on the induction of associative LTP, there are ways in which the effects we have described could be overcome. For example position-dependent changes in Ca2+conductivity could counteract the effects shown here. This could be achieved by changing the ratio of NMDA versus non-NMDA receptors or by changing the Ca2+permeability of the NMDA channel. In this regard, recent reports of the permeability to cations of reconstituted non-NMDA channels show that permeability can vary with the subunit composition of the channel complex (Hollmann et al. 1991). It has also been shown that the expression of non-NMDA channel subunits that make the channel permeable to Ca2+can be tissue and cell specific (Burnashev et al. 1992). Though subunit specific variability in cation permeability has not been shown for the NMDA channel, it suggests a molecular mechanism for creating localized differences in Ca2+permeability. Bekkers and Stevens (1990) report significantly lower Ca2+permeabilities for NMDA channels in hippocampal neurons, compared to the values determined by Mayer and Westbrook (1987) in mouse spinal cord neurons (respectively, 4.5 or 12.8% of the NMDA current being carried by Ca2+ at 2 mM external [Ca2+l). It is also conceivable that the Ca2+permeability of the NMDA channel might be affected by phosphorylation of the channel proteins. Similar changes in the degree of phosphorylation have been implicated
692
Erik De Schutter and James M. Bower
in numerous molecular mechanisms presumed to be involved in synaptic function (Huganir and Greengard 1990) and protein kinase C has been shown to potentiate NMDA current by reducing the voltage-dependent M 8 + block (Ben-Ari et al. 1992). Changing specifically the Ca2+ permeability of NMDA channels, while keeping their density and total conductance unchanged, would have the advantage that the induction of LTP could be controlled without changing the electrical properties of the neuron. Finally, whatever the significance of differences in dendritic location, our modeling results draw attention to the critical question of the actual permeabilities of NMDA channels to Ca2+. Mayer and Westbrook (1987) and Ascher and Nowak (1988) have pointed out that Goldman (1943) equations (used in the HL model) cannot account for the full properties of the NMDA channel. It is interesting to note that the experimental values for Ca2+permeability reported by Mayer and Westbrook (1987) are within 25% of the values that cause a maximum amplification of free [Ca2+1in the spine head in the HL model. Assuming that other parameters of the model are accurate, this suggests that the dendritic spine apparatus and its control over [Ca2+lmay operate close to maximal efficiency for sensing coactivation of synapses. However, this would not be true for the Ca2+ permeabilities reported by Bekkers and Stevens (19901, which at a relative permeability of 0.35 correspond to an amplification of only 4 to 5. In light of the potentially profound effect on LTP of NMDA receptor permeabilities, it appears to be important to make additional measurements of this value in other brain regions showing associative LTI?
Acknowledgments
This work was supported by Fogarty fellowship F05 TWO4368 to EDS and a grant from the Office of Naval Research, Contract N00014-91-51831. We thank the editors for useful comments on a first draft of this paper.
References Artola, A., and Singer, W. 1987. Long term potentiation and NMDA receptors in rat visual cortex. Nature (London) 330,84436. Ascher, P., and Nowak, L. 1988. The role of divalent cations in the N-methylBaspartate responses of mouse central neurones in culture. J. Physiol. 399, 247-266. Bekkers, J. M., and Stevens, C. F. 1990. Computational implications of NMDA receptor channels. C.S.H. Syrnp. Quant. B i d . 55, 131-135.
Sensitivity to Ca2+Permeability of NMDA Channels
693
Ben-Ari, Y., Anikstejn, L., and Bregestovski, P. 1992. Protein kinase C modulation of NMDA currents: An important link for LTP induction. Trends Neurosci. 15, 333-339. Brown, T. H., and Zador, A. M. 1990. Hippocampus. In The Synaptic Organization ofthe Brain, G. M. Shepherd, ed.,pp. 346-388. Oxford University Press, New York. Burnashev, N., Khodorova, A., Jonas, P., Helm, P. J., Wisden, W., Monyer, H., Seeburg, P. H., and Sakmann, B. 1992. Calcium-permeable AMPA-kainate receptors in fusiform cerebellar glial cells. Science 256, 1566-1570. Byme, J. H., and Berry, W. O., eds. 1989. Neural Networks of Plasticity: Experimental and Theoretical Approaches. Academic Press, San Diego. Gamble, E., and Koch, C. 1987. The dynamics of free calcium in dendritic spines in response to repetitive synaptic input. Science 236,1311-1315. Goldman, D. E. 1943. Potential, impedance, and rectification in membranes. J. Gen. Physiol. 27,37-60. Hebb, D. 0.1949. The Organization of Behavior: A Neuropsychological Theory. John Wiley, New York. Hollmann, M., Hartley, M., and Heinemann, S. 1991. Ca2+permeability of KAAMPA-gated glutamate receptor channels depends on subunit composition. Science 252, 851-853. Holmes, W. R., and Levy, W. 8. 1990. Insights into associative long-term potentiation from computational models of NMDA receptor-mediated calcium influx and intracellular calcium concentration changes. J. Neurophysiol. 63, 1148-1168. Huganir, R. L., and Greengard, P. 1990. Regulation of neurotransmitter receptor desensitization by protein phosphorylation. Neuron 5, 555-567. Jahr, C. E., and Stevens, C. F. 1990. Voltage dependence of NMDA-activated macroscopic conductances predicted by single-channel kinetics. 1. Neurosci. 10,3178-3182. Kanter, E. D., and Haberly, L. B. 1990. NMDA-dependent induction of longterm potentiation in afferent and association fiber systems of piriform cortex in vitro. Brain Res. 525, 175-179. Kitajima, T., and Hara, K. 1990. A model of the mechanisms of long-term potentiation in the hippocampus. Biol. Cybern. 64,33-39. Koch, C., Zador, A., and Brown, T. H. 1992. Dendritic spines: Convergence of theory and experiment. Science 256,973-974. Landfield, P. W., and Deadwyler, S. A., eds. 1988. Long-term Potentiation: From Biophysics to Behavior. Alan Liss, New York. Larkman, A., Stratford, K., and Jack, J. 1991. Quanta1 analysis of excitatory synaptic action and depression in hippocampal slices. Nature (London) 350, 344-347. Malenka, R. C. 1991. Postsynaptic factors control the duration of synaptic enhancement in area CAI of the hippocampus. Neuron 6,53-60. Mayer, M.L., and Westbrook, G. L. 1987. Permeation and block of N-methyl-D aspartic acid receptor channels by divalent cations in mouse cultured central neurones. J. Physiol. London 394,501-527. Mayer, M. L., Westbrook, G. L., and Guthrie, P. B. 1984. Voltage-dependentblock
694
Erik De Schutter and James M. Bower
by M$+ of NMDA-responses in spinal cord neurones. Nature (London) 309, 261-263. Mel, B. 1992. NMDA-based pattern discriminationin a modeled cortical neuron. Neural Comp. 4,502-517. Miller, S. G., and Kennedy, M. 8. 1986. Regulation of brain type I1 Ca2+/calmodulin-dependent protein kinase by autophosphorylation: A Ca2+-triggered molecular switch. Cell 44,861-870. Miiller, W., and Connor, J. A. 1991. Dendritic spines as individual neuronal compartments for synaptic Ca2+ responses. Nature (London) 354,73-76. Nicoll, R. A., Kauer, J. A., and Malenka, R. C. 1988. The current excitement in long-term potentiation. Neuron 1,97-103. Nowak, L., Bregestovski, P., Ascher, P., Herbert, A., and Prochiantz, A. 1984. Magnesium gates glutamate-activated channels in mouse central neurones. Nature (London) 307,462465. Perkel, D.H., Mulloney, B., and Budelli, R. W. 1981. Quantitative methods for predicting neuronal behavior. Neuroscience 4,823-837. Regehr, W. G.,and Tank, D. W. 1990. Postsynaptic NMDA receptor-mediated calcium accumulation in hippocampal CAI pyramidal cell dendrites. Nature (London) 345,807-810. Sala, F., and Herntindez-Cruz, A. 1990. Calcium diffusion modeling in a spherical neuron. Relevance of buffering properties. Biophys. J. 57,313-324. Wilson, M. A., and Bower, J. M. 1991. A computer simulation of oscillatory behavior in primary visual cortex. Neural Comp. 3,498-509. Wilson, M. A., and Bower, J. M. 1992. Cortical oscillations and temporal interactions in a computer simulation of piriform cortex. J. Neurophysiol. 67, 981-995. Wilson, M. A., Bhalla, U. S., LJhley, J. D., and Bower, J. M. 1989. GENESIS A system for simulating neural networks. In Advances in Neural Information Processing Systems, D. Touretzky, ed., pp. 48-92. Morgan Kaufmann, San Mateo, CA. Zador, A., Koch, C., and Brown, T. H. 1990. Biophysical model of a Hebbian synapse. Proc. Natl. Acad. Sci. U.S.A. 87, 67184722. Received 20 March 1992; accepted 25 January 1993.
This article has been cited by:
Communicated by Sidney Lehky
Models of Perceptual Learning in Vernier Hyperacuity Yak Weiss Interdisciplinary Program, Tel Aviv University, Tel Aviv 69978,Israel Shimon Edelman Department of Applied Mathematics and Computer Science, The Weizmann Institute of Science, Rehovot 76100, Israel Manfred Fahle Department of Neuroophthalmology,University Eye Clinic, Schleichstr. 12,07400, Tubingen, Germany Performance of human subjects in a wide variety of early visual processing tasks improves with practice. HyperBF networks (Poggio and Girosi 1990) constitute a mathematically well-founded framework for understanding such improvement in performance, or perceptual learning, in the class of tasks known as visual hyperacuity. The present article concentrates on two issues raised by the recent psychophysical and computational findings reported in Poggio et al. (1992b) and Fahle and Edelman (1992). First, we develop a biologically plausible extension of the HyperBF model that takes into account basic features of the functional architecture of early vision. Second, we explore various learning modes that can coexist within the HyperBF framework and focus on two unsupervised learning rules that may be involved in hyperacuity learning. Finally, we report results of psychophysical experiments that are consistent with the hypothesis that activity-dependent presynaptic amplification may be involved in perceptual learning in hyperacuity. 1 Introduction
The.term “perceptual learning” refers to the significant improvement, precipitated by practice, in the performance of human subjects in various perceptual tasks (Walk 1978). Some of the more intriguing aspects of perceptual learning, such as its specificity for particular stimulus parameters, and the associated lack of performance transfer to new parameter values (Fiorentini and Berardi 1981; Karni and Sagi 1991), remained until not long ago without an adequate computational explanation. In the present report, we show that a recently proposed mathematical framework for learning from examples, known as HyperBF approximation (Poggio and Neural Computation
5,695-7l8 (1993) @ 1993 Massachusetts Institute of Technology
Y. Weiss, S. Edelman, and M. Fahle
696
Girosi 19901, yields a biologically plausible and flexible model of perceptual learning in early vision. Following the work described in Poggio et al. (1992b), we concentrate on the example of learning vernier hyperacuity. 1.1 HyperBF Networks. Within the HyperBF framework, the problems of detection and discrimination of visual stimuli are approached in terms of the computation of multivariate functions defined over the input space. In particular, learning to solve these problems is considered equivalent to approximating the value of an appropriate function at any point in the input space, given its values at other points that belong to a set of examples. In a standard implementation, the task of approximating a function is divided into two stages (Poggio and Girosi 1990): an initial (usually nonlinear) transformation, in which the input is mapped into a set of basis functions, and a linear stage, in which the output function is computed as a linear combination of the basis functions. More precisely, the function f(x) is approximated as f(x) = c h(x) where h(x) is a vector of the values of the (nonlinear) basis functions, and c is a vector of weights. It is possible to divide the initial transformation into two substages: a transduction or dimensionality reduction stage, in which the input is mapped into a real vector space V,and a basis function computation stage, in which the value of each component of h is determined by a function hi : V -+ R. If the basis functions are radial, then hi(x) = hi( IIx - xoll), where ~0 are called the centers of the chosen set of basis functions. In a distributed implementation of this scheme by a multilayer network (Poggioand Girosi 19901, h represents the response of units in an intermediate layer to the stimulus, and c represents the weights of the synapses between the intermediate layer and the output unit.
-
1.2 Modeling 2AFC Experiments with HyperBF Networks. Because the output of the HyperBF module is a continuous function of the input, an additional mechanism is needed to model the decision stage in a two-alternative forced choice (2AFC) experiment. For that purpose, the full model (see Fig. 1) includes a threshold mechanism that outputs +1 or -1, depending on the sign of its input. Such a threshold unit is likely to be affected by noise, whose source can be, for example, the spontaneous activity of nearby units [we call this “decision noise,” as opposed to “early noise” that is already present in the values of the basis functions h(x)l. Thus, the output of the threshold unit can be described by R(x) = sign (c h DN), where DN is a zero-mean normal random variable with a standard deviation ON. Given the distribution of the noise in the system, it is possible to calculate the performance of the model in a 2AFC experiment. For example, if Y is the response of the HyperBF module to a certain right-offset vernier,
+
Models of Perceptual Learning in Vernier Hyperacuity
i
697
threshold HyperBF
module influence of
7
Figure 1: A model of a two-alternativeforced choice (2AFC) experiment. The output of the HyperBF module (left) is thresholded. Because the thresholding unit is also affected by the spontaneous activity of other units, only stimuli that elicit a strong response from the HyperBF network will be detected correctly with a high probability. and if early noise is neglected, the probability of a correct response is
The offset threshold of the model is then defined as the smallest offset, OT, for which the probability of correct responses as defined above exceeds 0.75. If the network's output depends on the input offset in an almost linear fashion (see Fig. 41, then the psychometric curve and the threshold can be predicted analytically. Substituting Y(o) = ao in equation 1.2 gives (1.2)
The percentage of correct responses plotted against the input offset will then be a sigmoid, and the threshold will be inversely proportional to a. Although the above approach makes it possible to model both interpolation and decision performance, we chose to concentrate on the former. This choice was motivated by the assumption that the interesting aspects of learning have to do with changes in the performance of the interpolation module. In the modeling of vernier acuity, this assumption
698
Y. Weiss, S. Edelman, and M. Fahle
is supported by the psychophysical findings of stimulus specificity of learning, which cannot be accounted for by decision-level changes alone (Poggio et al. 1992b).
1.3 Vernier Hyperacuity. A vernier target consists of two line segments, separated by a small offset in the direction orthogonal to the orientation of the lines. The subject's task in a vernier acuity experiment is to judge whether the misalignment is to the left or to the right. Humans solve this task successfully when the offset is as small as 5" or less, exhibiting a discrimination threshold that is far lower than that for spatial frequency grating discrimination or for two point resolution, and is smaller than the spacing of the cones in the fovea. Moreover, this astonishing precision of vernier discrimination, termed hyperacuity, is maintained even when the target is in motion (Westheimer and McKee 1975). It should be noted that hyperacuity performance does not contradict any physical law, since the optics of the eye at the fovea satisfy the constraints of the sampling theorem, making the spatial information recoverable in principle, by appropriate filtering (Barlow 1979; Crick et al. 1981). O u r choice of vernier hyperacuity as a paradigmatic case of perceptual learning was motivated by two considerations. First, the improvement in the vernier threshold has both fast and slow components (McKee and Westheimer 1978; Fendick and Westheimer 1983; Fahle and Edelman 19921, signifying that a number of distinct learning mechanisms may be at work. Second, as we have already mentioned, perceptual learning in vernier hyperacuity is specific for stimulus orientation (Fahle and Edelman 1992), a possible indication that performance in this task is based, at least in part, on interpolation among input examples acquired during learning. The next two sections describe in detail a computational model of hyperacuity performance based on HyperBF interpolation. 2 Modeling Hyperacuity-Level Performance 2.1 Simulated Experiments. In the simulated psychophysical experiments described below we compared two versions of the HyperBF scheme, one without and the other with a preprocessing or transduction stage. Unlike in Poggio et al. (1992a) where a transduction stage was used, in version A of the present model the basis function vector h represented the activities of three orientation-selectiveunits, with response peaks at -15", 0", and 15" with respect to the vertical. The response of each unit was calculated by convolving the stimulus retinal image with the appropriate receptive field function RF(x,y) and adding noise EN:
Models of Perceptual Learning in Vernier Hyperacuity
699
Figure 2: A simple network for vernier discrimination, obtained by combining responses of orientationally selective units. If the parameters of the oriented units are set according to the data from psychophysical masking experiments (Wilson and Gelb 1984), this network can solve the vernier task at a much smaller threshold than the size of the excitatory region of each unit. This network is suitable for solving the vernier task only when the stimulus is vertically oriented. Networks that combine responses of a range of orientationally selective units can be used for stimuli of different orientations. For the 0" unit RF(x,y) = e-~/"~(e-'/": - Be-"/":+Ce-'/":). Equations for the f15" units were obtained using standard rotation of coordinates. All constants were taken from Wilson and Gelb (1984) as those representing the smallest spatial frequency channel in human subjects. These constants are based on masking experiments, and are consistent with data from single cell recordings in macaque striate cortex (Wilson 1986; Wilson and Gelb 1984). EN was a zero mean gaussian random variable. The responses of these three basis functions to a sequence of vernier stimuli (a "vernier tuning curve") is shown in Figure 3, along with a more conventional orientation tuning curve for the vertically oriented unit. In version B, the orientation-selectiveunits described above were considered as transducers that mapped the activity pattern of the retinal array into R3. The network solved the problem by carrying out radial basis function interpolation in R3. The basis functions comprising h were three gaussians in R3,centered around the transduced representations of three vernier stimuli with offsets of -30,0,30 arcsec, respectively. The widths of the gaussians were set to the average distances (in R3)between their centers. In both versions the weight vector c was obtained by solving the equation:
Hc=Y
(2.1)
Y.Weiss, S. Edelman, and M. Fahle
700
n
8 0.0' (ro
0
-30
-20
- 10
0
10
20
30
Offset (arcsec) Response characteristics of basis units (version A)
Figure 3: "Single-cell recordings" from the basis units in the network, for a range of vernier offset values. The three curves marked by filled symbols are the responses to vernier stimuli of the three orientation-selective units shown in Figure 2. The error bar shows the standard deviation of the noise used to obtain the response curve shown in Figure 4. For comparison, the response curve of the vertically oriented unit to an unbroken line passing through the midpoints of the two lines comprising the vernier target is also shown (the top curve, marked by circles). The network treats such an oriented line as equivalent to a vernier target. Single-cell recordings from area 17 of a cat's visual cortex (Swindale and Cynader 1986) have revealed similar response patterns.
where each row in the matrix H represents the activity of the hidden layer to a vernier stimulus, and Yi is set to -1 for left offsets and to +1 for right offsets. 2.2 Results. We fmt explored the response of the network to vernier stimuli consisting of two lines 8' long. In both versions, the weight vector c was of the form cr (1,0, -l)Tand the network could be described simply as an output unit with an excitatory synapse with the unit tuned to the left-slanting lines and an inhibitory synapse with the unit tuned to the right-slanting lines (see Fig. 2). The zero weight of the vertically oriented unit is due to the fact, pointed out by Wilson, that this unit, despite being the most active one, carries the least amount of information relevant to the solution of the task. The output of this network for offsets ranging from -30" to 30" is shown in Figure 4a. These graphs represent the response of the network
Models of Perceptual Learning in Vernier Hyperacuity
.
a.
.
701
1
b.
Figure 4 (a) Response of the network plotted vs. the vernier offset of a stimulus presented at a fixed location with respect to the receptive fields. The two versions, A and B (topand bottom rows), yielded similar results here. (b)Response of the network vs. the vernier offset of a stimulus presented at random locations within a 20'' horizontal range around the common center of the receptive fields. to a vernier stimulus centered over the orientation detectors. In practice, however, random eye movements of the subject prevent such precise centering of stimuli during psychophysicalexperiments. Figure 4b shows the network's responses to stimuli displaced by random displacements of up to 20" in the vertical and horizontal directions.' It can be seen that the network's output is little affected by stimulus location. Interestingly, while the network is sensitive enough to signal a vernier displacement of only l",it practically ignores a much larger translation of the entire stimulus. Note that in both cases the network's output depends on the input offset in an almost linear fashion, while the function to be approximated, f(x), is a step function. Such poor approximation of the target function is understandable, considering the smoothness of the basis functions. A much better approximation could be reached using discontinuous basis functions, or, alternatively, using a large number of narrow gaussian basis functions. 'The distance between units that were spatial nearest neighbors was used by Wilson and Gelb (1984)as a free parameter to fit spatial frequency discrimination data. For the filters we used in our model, this distance corresponded to 38.3". Thus, a stimulus appearing at a random retinal location would always be within 20" of a set of filters.
Y. Weiss, S. Edelman, and M. Fahle
702
. ... .
. .. . .
.
a.
.
. .. ...
. .. ....
.
.
. .. ...
...
.
.
l:wth7.raA")
Dmpm&a of lhrnhold mm wr&r
I m W
Figure 5: Dependence of vernier threshold on line length (a) and separation between lines (b). Threshold was estimated by making assumptions regarding the statistical distributions of the noise sources. These distributions were held fixed as stimulus parameters varied. Compare to Figure 6.
......... ................. .....
.....,.. ..............................
10
a.
m
.
..........
'
Length (:;amh)
Dmp.L.n m I lbnbmld *m wr.lmr Im+
.
.............
b.
,
a
*
#
&.Uonl\.ra& Dmpmmhm #I 8 b m s L . M 08 I l u upu.U.m
Figure 6 Dependence of vernier threshold on line length (a) and separation between lines (b). Each plot shows data from two observers, replotted from Westheimer and McKee (1977). 2.3 Discussion. Since both versions of the model (A and B) can serve as the basis for hyperacuity level performance, we adopted the simpler, linear, version that omits the transduction stage, as a minimalist platform for studying the improvement of hyperacuity with practice. The primary motivation for using oriented filters as basis functions for vernier hyperacuity, was the report by Wilson (1986) [see also Klein and Levi (1985) and Watt and Morgan (1985)l that the responses of these filters can explain psychophysical data concerning hyperacuity. In Wilson's model, detection threshold in psychophysical experiments is related to the Euclidean distance between a distributed "filter representation" of the two stimuli. The filter representation is constructed by pooling the responses of filters at all orientations and spatial frequencies concentric with the stimulus as well as over spatial nearest neighbors. Our model,
Models of Perceptual Learning in Vernier Hyperacuity
703
in contrast, replaces the distributed representation with the output of a single neuron. This does not necessarily make our model more biological, but it makes it easier to model hyperacuity learning in terms of synaptic modification. Wilson’s model replicated several psychophysical results concerning the change in hyperacuity thresholds in a variety of hyperacuity tasks when stimulus parameters varied. To see whether these results still hold when the distributed representation is collapsed to a single neuron, we investigated the response of the network to verniers of varying line length and varying gap in the direction of the lines. The results of these simulations appear in Figure 5. In these simulations the statistical distribution of the noise was held fixed (and thus thresholds could be estimated) while the parameters of the stimulus varied. The dependence of the threshold on line length exhibited by the model agrees reasonably well with the data in McKee and Westheimer (1977). Specifically, the threshold decreases steeply with increasing segment length for lengths under 4’, and is essentially unaffected by further increase. The dependence of the threshold on line separation, however, agrees with psychophysical data only qualitatively. The model’s threshold increases steeply for separations greater than 4’, while in human subjects the increase is more gradual and is especially noticeable for separations over 7’. Both the results regarding line length and line separation can be attributed to the fact that the width or of the spatial frequency mechanism used by the model was 3.65’(Wilson and Bergen 1979). As Wilson has pointed out, increasing line length beyond the value of cy does not add any significant information, while increasing line separation beyond uy forces the human subject to use the less sensitive spatial mechanisms. Additional motivation for using orientationally selective units as basis functions in our model comes from the electrophysiological studies of Swindale and Cynader (1986), who studied the response to a vernier break by orientation selective cells in cortical area 17 in the cat. The results of that study showed that orientation-selective cells in area 17 can discriminate between different offsets in a vernier stimulus. Specifically, those cells tended to respond to the vernier stimuli in the same manner as they did to an oriented line passing through the midpoints of the two segments composing the vernier. This effect is basically due to the spatially low-pass action of the orientationally selective units, and has been replicated by our model (Fig. 3). Swindale and Cynader used a method proposed by Parker and Hawken (1985) to estimate the “hyperacuity threshold” of single neurons in area 17. This threshold measures the statistical reliability of a change in a neuron’s response to a vernier break. Because the thresholds of some neurons were as low as the behavioral threshold of the cat in vernier discrimination, the authors suggested that the performance of these neurons was the limiting factor in hyperacuity, obviating the need for a fine-grid reconstruction of the stimulus. In response, Parker and Hawken (1987) argued
704
Y.Weiss, S. Edelman, and M. Fahle
that the possibility of a fine-grid reconstruction could not be mled out, because the factors limiting the behavioral hyperacuity threshold may be retinal and not cortical, as suggested by the data on the hyperacuity thresholds of cat retinal ganglion cells due to Shapley and Victor (1986). We note that the HyperBF approach is equally capable of modeling retinally or cortically based hyperacuity mechanisms. While our present model used orientationally selective units similar to cortical simple cells, the HyperBF scheme of Poggio et al. (1992a) that relied on responses of circularly symmetric units was equally successful in replicating hyperacuity phenomena. The notion of interpolation, inherent to the HyperBF approach, does provide, however, a useful insight into one issue important for both sides in the retina vs. cortex debate, namely, the way of relating behavioral thresholds to those of single neurons. Consider the vernier tuning curve of the vertically oriented unit in our model (the curve marked by triangles in Fig. 3). Despite the fact that this curve is relatively wide and shallow, the responses of three units of this type can support hyperacuity vernier discrimination. Addressing Westheimer’s (1981) claim that neurons with a wide orientation response characteristic cannot be involved in hyperacuity tasks, Swindale and Cynader argue that a broadly tuned neuron can still support hyperacuity, as long as its response pattern is statistically reliable. This is equivalent to saying that the slope of the tuning curve, and not its width, should be used as a measure of a neuron’s use for hyperacuity. In contrast, our model suggests that neither measure should be considered a sole determinant of the behavioral threshold: the responses of cells slightly rotated with respect to the stimulus actually provide more relevant information for solving the task. Thus, a network of very reliable vernier detectors may perform worse than a network of less reliable units with a large overlap in their receptive fields. The limiting factor in the vernier task seems therefore to be not the performance of a single unit, but rather the ability of the system to pool responses from different units with overlapping receptive fields, and the manner in which these units cover the range of possible stimulus orientations (see Fig. 7).’ To conclude this discussion, we note that Snippe and Koenderink (1992)recently demonstrated analytically using an ideal observer model, that the resolution of a channel-coded system of circularly symmetric receptive fields is determined both by the reliability of each channel and by the degree of overlap between the channels. 3 Modeling Perceptual Learning in Hyperacuity
We now turn to explore the possible ways in which the performance in the vernier task could be made to improve with practice. First, we *Swindaleand Cynader mention pooling the outputs of several neurons, but they suggest that this pooling occurs between neurons with similar responses, thus increasing the reliability of a single channel, and not pooling the responses of different channels.
Models of Perceptual Learning in Vernier Hyperacuity
- - --
....
-I
.
Response of oriented neuron
Response of vertical neuron 120
705
90, 80
,
,
-20
-10
I
I
3
20
30
70 60 50 40
__
2n
_ _- 3 0
-20
-10
0
10
20
30
20 10 0 -10 -30
0
10
Figure 7 To demonstrate the importance of pooling responses of a number of overlapping filters, we conducted two simulations in which the output of the filters was passed through a monotonic nonlinearity before noise was added. In the first simulation, the nonlinearity was compressing at high activity rates, resulting in a shallow vernier tuning curve. In the second simulation, the nonlinearity was accelerating at high activity rates, leading to a steep vernier tuning curve. (a) The average response of two simulated neurons to vernier displacements in a vertically oriented stimulus (lower curve: type I; upper curve: type 11). (b) The average response of two simulated neurons with the same nonlinearity as those in (a) but with oriented receptive fields. A network comprised of three neurons of type I1 (upper curve) performs better in the vernier task than a network comprised of three neurons of type I (lower curve), despite having a shallower vernier tuning curve. show how different types of learning rules can be formulated within the HyperBF framework. We then focus on two likely candidate mechanisms for unsupervised learning in vernier acuity experiments, and describe additional simulations that help distinguish between them. The results of these simulations suggested a psychophysical experiment, reported in the next section. We also discuss a possible biological basis for one of the two unsupervised learning rules.
3.1 Classification of Perceptual Learning Models. Perceptual learning models can be categorized according to two basic criteria: how much prior knowledge is assumed and how dependent the learning is on external supervision. Both parameters can assume a wide range of values, a fact that is sometimes overlooked when models are characterized simply as “supervised” or “unsupervised.” Supervised models, in turn, may differ in the nature of feedback signal they assume. Barto (1989) distinguishes between two types of error signals-those generated by a ”teacher” that can point out which parameter of the model should be modified in response to the error it has committed, and those generated by a ”critic” that is unaware of the work-
706
Y. Weiss, S. Edelman, and M. Fahle
Figure 8: The structure of the network used throughout the learning simulations. The output neuron receives input from 100 neurons, three of which correspond to the properly oriented and positioned linear filters, and the rest to other, random, inputs. Performance is improved by modifying the connections between these neurons and the output neuron.
ings of the model. “A critic can generate a payoff based on knowledge solely of what it wants accomplished and not of how the learning system can accomplish it” (Barto 1989, p. 89). Unsupervised models can also differ in their dependence on feedback. Some unsupervised models merely replace the external feedback signal with a self-provided feedback signal, an “internal teacher.’’ Alternatively, an unsupervised model can assume complete independence on feedback, either internal or external. A fundamental tradeoff exists between a learning model’s reliance on prior knowledge and on feedback. A model that relies heavily on a teacher can afford to make few prior assumptions, while a model that assumes independence from feedback must rely on prior knowledge to a greater extent. As an example of feedback-independent unsupervised learning, consider the model of hyperacuity, proposed by Barlow and others (Barlow 1979; Crick et al. 1981), that relies on a fine-grid reconstruction of the retinal signal in the cortex. Assume that the module that processes the reconstructed signal has no intrinsic capacity for learning, but operates in such a manner that increasing the accuracy of the he-grid reconstruction causes an improvement in its performance in the hyperacuity task. Any improvement in the fine-grid reconstruction would then cause a decrease in hyperacuity threshold, but this improvement need not be feedback-dependent. Indeed, the reconstruction may improve after training on a completely different task. 3.2 Different Learning Modes in a HyperBF Network. In Poggio et al. (1992a1, a HyperBF network was synthesized from a “tabula rasa” initial state. Only the shape of the basis functions (that were radially symmetric multidimensional gaussians) was assumed to be given. The centers of the basis functions were determined by the training examples in an unsupervised fashion, while the coefficients c were updated using
Models of Perceptual Learning in Vernier Hyperacuity
707
a pseudoinverse technique that assumed an external teacher. Poggio et al. also suggested using self-provided feedback to replace the external teacher. In our model, the structure of the network was assumed to remain fixed throughout training. The network (see Fig. 8) was comprised of 100 units that were connected to an output neuron. Three of the units represented the oriented linear filters described in section 2.1 and the activity of the remaining units was random. The model's performance was improved solely by changing the weight vector c, according to four different update rules: = c(') + vh [Y(x)- O(x)], where Y 1. The Widrow-Hoff rule, and 0 represent the desired and the actual output for the stimulus x. This rule is supervised by a teacher, and is equivalent to solving equation 2.1 by an incremental pseudoinverse technique (Widrow and Steams 1985). 2. The Mel-Koch rule. c('+') = c(') + ahY (x) - P c ( ~ )This . learning rule was suggested by Me1 and Koch (1990) and was designed to maximize the correlation between the output and the activities of the basis function units, while minimizing the total synaptic weight of the linear stage. This model, unlike the previous one, is supervised by a critic who knows only what the correct answer should be. 3. The self-supervised Widrow-Hoff algorithm. This algorithm is similar to the first one, but feedback is provided only for those inputs in which the vernier offset exceeds the baseline threshold (set at 15"). This model is unsupervised, but is still feedback-dependent. It was designed to simulate the conditions in psychophysical experiments in which subjects receive no feedback at all, but nevertheless possess a clear indication of the correctnessof their response for the large values of vernier offset when the stimulus looks trivially easy. Under these conditions, the subjects' thresholds improve with practice, albeit at a slower rate than when explicit feedback is available (Fahle and Edelman 1992). = c!') + ac!'),if 1h:')I > E. This is an unsupervised, usedependent rule, which is independent of feedback. As opposed to the rules listed above, which made no assumptions about the nature of the connections c prior to learning, this rule assumes that the weight vector for the oriented filters is proportional to (+I,o, -11~.
4. Exposure-dependent learning (EDL). c!'")
3.3 Results. The learning curves for the four rules described in the previous section are shown in Figure 9. All rules except one showed a gradual improvement of performance with pra~tice.~ The shapes of the 3The failure of Me1 and Koch's rule to converge was not due to simulation details. An analysis of their rule formulated as a first-order differential equation showed that
Y.Weiss, S. Edelman, and M. Fahle
708
1 Y 0
Y 0
u
I
u
!i
a:
0.85
2
8
0.8 0.75 0.7 0.65 0.6 0.55
u
B 1
n.5
0
2
4
6
8 10 12 1 4 16 18 20 Block
a.
z
8 Y
B
P
0.6
0.55 0.5 0
2
4
6
8 10 12 1 4 1 6 18 20
Block
0.95
0.95
u u
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5
z
0.9 0.85
u
0.8
u
0.75
a:
0.65
k
0.7 n _. 6. _
0
2
4
6
0
8 10 12 1 4 16 18 20
Block C.
0.9 0.85 0.8 0.75 0.7 0.65
b.
1 u U
0.95
d.
2
4
6
8 10 12 1 4 16 18 20
Block
Figure 9 Learning curves for different coefficient-learning rules defined in Section 3.2 [(a)Widrow-Hoff; (b) Mel-Koch; (c) self-supervised Widrow-Hoff; (d) EDLI. All rules (except the Mel-Koch rule) show a gradual improvement of performance with practice. The models do not achieve perfect performance due to the noise present in the receptor activities (“early noise”). When this noise is increased, performance still improves, but the final performance level is lower (see the lower curve in condition d). learning curves should be considered as a qualitative characteristic of the models’ performance, since they depend on the scalar parameters 77 and a (we note that a considerable variability in the learning rate is also found in human subjects). The models do not achieve perfect performance due to the noise present in the receptor activities (“early noise”). When this noise is increased, performance still improves, but the final performance level is lower (see the lower curve in Fig. 9d). 3.4 Separating the Noise from the Signal: W o Approaches. Of the four learning rules mentioned in the previous section, the two likely candidates for accounting for the improvement with practice found in psychophysical experiments are the two unsupervised rules, because learning has been found to occur in the absence of feedback. To help eluit was not guaranteed to converge under the conditions of the simulated experiments, in which widely differing inputs are presented in alternation.
Models of Perceptual Learning in Vernier Hyperacuity
709
m 0.95 0.9 0.85
4
0.6
3.5 3
0.5
0.8
0.75 0.7 0.65 0.6 0.55 0.5
0.7
1.5
0.4
a
0.1
1.5 1
0.a
0.1
o a
4
6 8 10 la 14 16 ia
ao
0.)
0
0
0 I
1
0.35
0.9
0.3
0.I
0.15
0.7
0.1
0.6
0.15
0.5
0.1
0.4
0.05
4
6 I 10 11 14 16 18 10
o
1
6
6 I 10 ia 14 16 11 a0
3.S
3 1.5
a 1.5 1 0.5 0
0 I
4
6 I 10 12 14 16 I8 10
o a 4 6 8 10 ia 14 16 18 ao
Figure 10 A comparison of the two unsupervised learning rules under two conditions: one where the activity of noisy neurons was determined randomly before each presentation, and the other where they remained constant during learning. The Widrow-Hoff algorithm causes improved performance under both conditions but the EDL algorithm does not improve in the second condition, since it tends to amplify the noise rather than the signal.
cidate the difference between the two learning rules, we conducted an additional set of learning simulations. These simulations compared the performance of the two learning rules under two conditions which differed only in the firing patterns of the “noise” units. In both conditions, these firing patterns were determined using a zero mean gaussian random variable. In the first condition (conditionA) these firing rates were determined before each presentation of the stimulus (this condition is identical to the one used in the simulations in the previous section) while in the second condition (condition B) these firing rates were determined prior to training and remained constant throughout training. Note that if each presentation is considered by itself, the statistical properties of the “noise neurons” are identical (independent, identically distributed gaussians).
3.4.1 Results. The behavior of the two learning rules under these two conditions is illustrated in Figure 10. The first column shows the percentage of correct responses for the two learning rules. The Widrow-Hoff rule supports learning under both conditions, but the EDL rule does so only in condition A, and actually leads to deterioration of performance with practice in condition B.
Y. Weiss, S. Edelman, and M. Fahle
710
The second and the third columns show the evolution of signal and noise magnitudes during training. These were defined as
where indices i = 1 , 2 , 3 correspond to the oriented filters. Note that the term "signal" for the value S is somewhat misleading since it also includes the contribution of early noise present in the receptors. The difference between the two learning rules is in the way they distinguish signal from noise. The Widrow-Hoff algorithm converges to a vector c such that c h is closest in the mean square sense to the desired output. Thus,presynaptic activity, which is completely uncorrelated with the desired output, will result in a zero-weight synapse, in effect labeling the corresponding presynaptic unit as noise. Note that according to this definition, the activity of the vertically oriented filter is also labeled as noise, and indeed the Widrow-Hoff algorithm results in a zero-weight synapse between the vertically oriented filter and the output unit. A feedback-independent learning rule such as EDL cannot rely on the correlation between the desired output and the presynaptic activity in distinguishing noise from signal. The heuristic used by this rule is to label as signal those inputs which are consistently active at a significant rate when the stimulus is presented. This is achieved by increasing by a small amount the contribution of a unit whose activity is greater than some threshold E at each presentation. Thus, when the activities of the random units are recalculated prior to each stimulus presentation, the increase in their contribution to the output unit is negligible compared to the increase in the contribution of the oriented filters (see Fig. lo), because only the oriented filters are consistently active. In contrast, in condition B, the number of noise units whose activity is greater than E is the same as in condition A, but the increase in the contribution of the noise units is significant, due to the consistent activity of the same noise units. In the simulations described above, the number of random units with activity greater than E was greater than the number of oriented filters, so that in condition B noise was boosted more than the signal, resulting in decreased performance.
-
3.4.2 A Possible Biological Basis of the EDL Rule. The EDL update rule requires a modulatory signal to mark the time frame within which presynaptic activity should be measured, and a mechanism for updating synapses based on presynaptic activity. A learning rule that satisfies
Models of Perceptual Learning in Vernier Hyperacuity
711
these requirements has been studied extensively both at the behavioral and at the cellular level by Hawkins et al. (1983) in Aplysia. The Aplysia’s gill withdrawal reflex following stimulation of a particular site on the siphon was found to be enhanced by a shock delivered to the tail in parallel with the stimulation of the siphon. This enhancement was specific to the site which was stimulated at the time of the shock delivery to the tail, presumably because it depended on simultaneity of the shock and the siphon stimulation. A correlate of this phenomenon at the cellular level was found in a study of the change of the excitatory postsynaptic potentials (EPSPs) elicited in a common postsynaptic neuron by siphon sensory neurons. During training, two sensory neurons were stimulated intracellularly. Stimulation of one of them immediately preceded the shock to the tail, while stimulation of the other sensory neuron followed the shock by 2.5 min. It was found that the change in the amplitude of the EPSP from the paired neuron was significantly greater than that in the unpaired neuron. Further experiments suggested that “activity-dependent amplification of facilitation is presynaptic in origin and involves a differential increase in spike duration and Ca2+ influx in paired versus unpaired neurons” (Hawkins et al. 1983). The requirements of the model of Hawkins et al. are (1) facilitatory neurons, which are excited by motivationally significant stimuli and which may project very diffusely (in principle, a single such neuron that produced facilitation in all of the sensory neurons would be sufficient to explain the results) and (2) differential activity in the neurons that receive facilitatory input (Hawkins et al. 1983). In our simulations we assumed that all the units received the modulatory signal, that is, the network had no a priori knowledge as to the activities of which units were more likely to be significant. The main difference between our mechanism and the one suggested by Hawkins et al. is that we assume that when synaptic amplification occurs, it is proportional to the previous synaptic strength. Without this assumption, the modification of the synapses distorts whatever structure the network connections had prior to learning. For example, the connections of the vertically oriented unit, which is irrelevant to the task, could increase if this assumption were dropped. We note that this assumption adds a Hebbian element to the learning rule. Consider two units with identical activity, one of which has a strong connection to the output unit and consistently takes part in the activation of the output unit, and the other is weakly connected to the output. The synaptic weight of the strongly connected unit, whose activity is correlated with the output, would increase more significantly than that of the weakly connected unit. We do not assume, however, that there is any causal relationship between the correlation of pre- and postsynaptic activities, and the synaptic modification. Hence, our rule may be classified as a “noninteractive Hebbian rule” (Brown et al. 1990).
Y. Weiss, S. Edelman, and M. Fahle
712
4 Psychophysical Experiments
In the previous section, we described the difference between the two unsupervised learning rules in terms of their approach to distinguishing between signal and noise. If a group of units is consistently active during stimulus presentation, but their activities are uncorrelated with the desired output, then they would be labeled as noise by the Widrow-Hoff algorithm and as signal by the EDL rule. To elucidate the possible role of an EDL-like rule in the improvement of performance in vernier hyperacuity, we conducted psychophysical experiments using cross-shaped stimuli shown in Figure 11. Two orthogonal verniers appeared simultaneously in each trial, but the subjects were required to judge the sense of the misalignment of only one of the two verniers (the orientation of the relevant vernier was the same throughout each experimental block). In this situation, the units responsive to the irrelevant part of the stimulus are consistently activated during stimulus presentation, but their activity is uncorrelated with the desired output. Hence, i€ the learning mechanism contains a significant use-dependent component (of the kind that can be provided by the EDL rule), such an experiment is expected to demonstrate similar improvement with practice in the two orientations. 4.1 Method. Subjects performed three tasks that involved crossshaped stimuli as in Figure 11. Stimuli were generated and displayed on a Silicon Graphics 4D35/TG workstation. Viewing distance was such that one pixel corresponded to about 8". Stimuli were presented for 100 ms and were separated by a 1 sec interval (during which a frame was displayed to assist fixation). Subjects indicated their response by pressing one of two buttons on the computer mouse. Auditory feedback was given for incorrect responses. The stimuli in the first two tasks, HORIZONTAL CROSS and VERTICAL CROSS, were the same, except that in one (VERTICALCROSS) subjects were required to determine the direction of misalignment of the vertical part of the stimulus (and received appropriate feedback), while in the other (HORIZONTAL CROSS) they were required to judge the misalignment of the horizontal part of the stimulus (again with appropriate feedback). In the third task, DIAGONAL CROSS, the stimuli were oriented diagonally. Each block consisted of a fixed number of presentations of all offsets of the relevant stimulus (in the range from -20 to $20 pixels) in a random order of presentation. The irrelevant part of the stimulus (e.g., the vertical vernier in a HORIZONTAL CROSS block) was presented with a random offset and was thus uncorrelated with the error signal. The experiments consisted of three stages: 0 Measurement of baseline performance in all three tasks. 0
Training in either the vertical or horizontal tasks.
Models of Perceptual Learning in Vernier Hyperacuity 0
713
Testing in the two tasks for which there was no training.
The diagonal task was added only after the first two subjects had completed the experiment, and they were called back for an additional block of testing. Before they were tested on the diagonal cross, we assessed their performance on the horizontal cross to determine whether they re-’ tained their learning despite the elapsed time (10 days for observer YK and 45 days for observer FL). 4.2 Results. Results are shown in Table 1. Because of the extensive coverage of the range of offsets (all offsets smaller than 20 pixels were presented in each block), we were able to plot the observers’ psychometric curves in each block. For some observers (see Fig. 12), these curves assumed the usual sigmoid shape only after training. For this reason, we measured the observers’ performance by the percentage of correct
VERTICAL CROSS
HORIZONTAL CROSS
DIAGONAL CROSS
Figure 11: The stimuli used in the experimentswere similar to those seen above. In each block observers were shown vertical, horizontal, or diagonal crosses with randomly varied offsets.
Y. Weiss, S. Edelman, and M. Fahle
714
100
80
f
-
+-
.
60
40
-
1
2o0 0
5
10 OFFSETS
15
20
Figure 12: Representative psychometric curves (observer FL) before and after learning in the horizontal task. The response curves start to resemble a sigmoid only after training. responses in a range of offsets kept constant throughout training, rather than by a threshold estimated via probit analysis. As in previous experiments on perceptual learning, individual differences in the learning rates could be observed. When learning did occur for the attended part of the stimulus, it was accompanied by a significant improvement for the part that was present throughout training but was uncorrelated with the feedback. No such concomitant improvement was found after training in the diagonal test stimulus, which was not present in training (but note that only the results of observer RM in the diagonal tasks serve as a true control, as the others were either not tested for baseline or did not learn at all). Observer FL apparently did not retain his performance after a prolonged time break, while observer YK showed a much smaller decrease in performance. Observer YA explained that when tested on the vertical task (following training on the horizontal task) she tried to find a hidden cue in the horizontal verniers displayed simultaneously with the vertical verniers. Since these verniers were uncorrelated with the correct response, this explains the deterioration of her performance on the vertical task.
Models of Perceptual Learning in Vernier Hyperacuity
715
Table 1: The Percentage of Correct Responses in Each Task before and after Training? Subject
Horizontal Before
Vertical
After
Before
After
Diagonal Before
After
-
58 f 1.5 75 f 1.2
Vertical training
FL YK
61 f 1.8 83 f 0.64 (67) 59 f 1.2 75 f0.85 78 f 1.33 91 f 0.5 (88) 67 f 1.7 77 f 1.1
-
Horizontal training YA RM
AS
85 f 1.0 70 f 1.6 81 1.2
*
87 f 0.7 78 f 1.1 83 f 0.9
76 f 1.4 71 f 1.3 75 f 1.5 76 f 1.1 75 f 1.4 84 f 0.8 86 f 0.9 83 f 0.9 73 f 1.5 74 f 1.2 70 f 1.6 60 f 1.5
*An improvement in the horizontal task is accompanied by an improvement in the vertical task, and a lack of improvement in the horizontal task is accompanied by a corresponding lack of learning in the vertical task. The numbers in parentheses are the performance of subjects who were called back for additional testing after a significant time break (45 days for observer FL. and 10 days for observer YK).
4.3 Discussion. These results are consistent with a use-dependent learning rule such as EDL. Note that this rule still predicts that learning will be stimulus-specific and will not transfer to new tasks, but it distinguishes between two notions of novelty: 1. A task is new if an appropriate response function cannot be interpolated from that of familiar examples;
2. A task is new if the units used to compute the response function were not sigruficantly active during familiarization or training. In some cases (Fiorentini and Berardi 1981; Fahle and Edelman 19921, both definitions of novelty apply to the stimuli used to assess transfer of training, and the lack of transfer to these stimuli can be accounted for by models that involve either use-dependent rules or feedback-dependent ones, or both. A further indication that use-dependent synaptic modification may be involved in perceptual learning, has been reported recently by Karni and Sagi (1991). In their experiments, subjects performed letter discrimination followed by texture discrimination in the same complex stimulus. Their resulfs show significant learning in the texture task, even though feedback was given only for the letter discrimination.
716
Y. Weiss, S.Edelman, and M. Fahle
5 Conclusion
The central assumption of the HyperBF approach to the modeling of perceptual function is that the human ability to solve a variety of different perceptual tasks is based on the acquisition of specific input-output examples and on subsequent optimization of the use of the stored examples with practice. Rationale for this twofold assumption has been provided by the results of simulated psychophysical experiments (Poggio et al. 1992a) that demonstrated that a HyperBF model can learn to solve spatial discrimination tasks with hyperacuity precision, starting from a “tabula rasa” state and continuously improving its performance with repeated exposure to the stimuli. In the present paper. we concentrated on two computational details of the HyperBF model of vernier acuity. First, we investigated the possibility that oriented spatial filters known to exist in the primate visual system (namely, units similar to the simple cells of Hubel and Wiesel 1962) can serve as the basis functions in a HyperBF network. Second, we explored the different mechanisms available within the HyperBF framework for incremental learning at the level of the linear combination of basis function activities. Our findings indicate that a simple feedback-independentrule for synaptic modification, that we called EDL, for exposuredependent learning, may be involved in the improvement of the performance of human subjects with practice. Both our simulations and our psychophysical data suggest that a significant component of learning in hyperacuity may be based on stimulusdriven feedback-independentamplification of unit responses, rather than on precise feedback-guided fine tuning within a perceptual module. We remark that the perceptual module whose prior availability is assumed by the EDL rule can either be hard-wired from birth, or synthesized in a task-driven fashion, as suggested in Poggio et al. (1992a). If one accepts the possibility that the visual system is capable of modlfying certain aspects of its functional architecture on the fly, the stimulus-driven learning can be given an alternative account in terms of acquisition of new HyperBF centers (T.Poggio, personal communication). It is not clear to us at present whether or not this possibility can be distinguished psychophysically from our account in terms of synaptic modification using existing centers and the EDL rule. The presence of the initial fast stimulus-specific component in the learning curve in hyperacuity tasks (Poggio et al. 1992b) is consistent with the module synthesis view. The record of the Iast two and a half millenia indicates, however. that the Platonic notion of innate ideas (corresponding,in the present case, to innate perceptual mechanisms tuned by experience) is sufficiently resilient to cope with mere circumstantial evidence to the contrary. It remains to be seen whether a more direct approach, possibly combining physiology with psychophysics and computational modeling, will be more successful in elucidating the nature of perceptual learning.
Models of Perceptual Learning in Vernier Hyperacuity
717
Acknowledgments We thank T. Poggio for stimulating discussions, and two anonymous reviewers for useful and detailed suggestions. Y. W. was supported by the Karen Kupcinet Fund, and by a grant to S. E. from the Basic Research Foundation, administered by the Israel Academy of Sciences and Humanities.
References Barlow, H. B. 1979. Reconstructing the visual image in space and time. Nature (London) 279,189-190. Barto, A. 1989. From chemotaxis to cooperativity: Abstract exercises in neuronal learning strategies. In The Computing Neuron, R. Durbin, C. Miall, and G. Mitchison, pp. 73-98. Addison-Wesley, New York. Brown, T. H., Kairiss, E. W., and Keenan, C. L. 1990. Hebbian synapses: Biophysical mechanisms and algorithms. Annu. Rev. Neurosci. 13,475-511. Crick, F. H. C., Man; D. C., and Poggio, T. 1981. An information-processinga p proach to understanding the visual cortex. In The Organization of the Cerebral Cortex, F. Schmitt,ed., pp. 505-533. MIT Press, Cambridge, MA. Fahle, M. W., and Edelman, S. 1993. Long-term learning in Vernier acuity: Effects of stimulus orientation, range, and of feedback. Vision Res. 33, 397412. Fendick, M., and Westheimer, G. 1983. Effects of practice and the separation of test targets on foveal and perifoveal hyperacuity. Vision Res. 23,145-150. Fiorentini, A.,and Berardi, N. 1981. Perceptual learning specific for orientation and spatial frequency. Nature (London) 287,453-454. Hawkins, R. D., Abrams, T. W., Carew, T. J., and Kandel, E. R. 1983. A cellular mechanism of classical conditioning in Aplysia: Activity-dependent amplification of presynaptic facilitation. Science 219,400-404. Hubel, D. H., and Wiesel, T. N. 1962. Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. J.Physiol. 160, 106-154. Karni, A., and Sagi, D. 1991. Where practice makes perfect in texture discrimination. Proc. Natl. A d . Sci. U.S.A. 88,4966-4970. Klein, S. A., and Levi, D. M. 1985. Hyperacuity thresholds of 1 sec: theoretical predictions and empirical validation. J. Opt. SOC.Am. A2, 1170-1190. McKee, S. P., and Westheimer, G. 1978. Improvement in vernier acuity with practice. Percept. Psychophys. 24,258-262. Mel, B. W., and Koch, C. 1990. Sigma-Pi learning: On radial basis functions and cortical associative learning. In Neural Znformution Processing Systems, D. Touretzky, ed., Vol. 2, pp. 474-481. Morgan Kaufmann, San Mateo, CA. Parker, A. J., and Hawken, M. J. 1985. Capabilities of monkey cortical cells in spatial resolution tasks. J. Optical SOC.Am. 2, 1101-1114. Parker, A. J.,and Hawken, M. J. 1987. Hyperacuity and the visual cortex. Nature (London) 326,105-106.
718
Y. Weiss, S. Edelman, and M. Fahle
Poggio, T., Edelman, S., and Fahle, M. 1992a. Learning of visual modules from examples: A framework for understanding adaptive visual performance. Comput. Vision, Graphics, Image Process.: Image Understanding 56, 22-30. Poggio, T., Fahle, M., and Edelman, S.1992b. Fast perceptual learning in visual hyperacuity. Science 256, 1018-1021. Poggio, T.,and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247,978-982. Shapley, R.,and Victor, J. 1986. Hyperacuity in cat retinal ganglion cells. Science 231,999-1002. Snippe, H. P., and Koenderink, J. J. 1992. Discrimination thresholds for channelcoded systems. B i d . Cybern. 66, 543-551. Swindale, N. V., and Cynader, M. S. 1986. Vernier acuity of neurones in cat visual cortex. Nature (London) 319,591-593. Walk, R. D. 1978. Perceptual learning. In Handbook of Perception, E. C. Carterette and M. P. Friedman, eds., Vol. IX, pp. 257-298. Academic Press, New York. Watt, R. J., and Morgan, M. J. 1985. A theory of primitive spatial code in human vision. Vision Res. 25, 1661-1674. Westheimer, G. 1981. Visual hyperacuity. Prog. Sensory Physiol. 1, 1-37. Westheimer, G., and McKee, S. P. 1975. Visual acuity in the presence of retinal image motion. 1.Optical Soc. Am. 65,847-850. Westheimer, G., and McKee, S. P. 1977. Spatial configurations for visual hyperacuity. Vision Res. 17,941-947. Widrow, B., and Steams, S. D. 1985. Adaptive Signal Processing. Prentice Hall, Englewood Cliffs, NJ. Wilson, H. R. 1986. Responses of spatial mechanisms can explain hyperacuity. Vision Res. 26,453-469. Wilson, H. R., and Bergen, J. R. 1979. A four mechanism model for threshold spatial vision. Vision Res. 19,19-32. Wilson, H. R., and Gelb, D. J. 1984. Modified line-element theory for spatial frequency and width discrimination. J . Optical SOC.Am. 1,124-131.
Received 24 June 1992; accepted 1 February 1993,
This article has been cited by: 2. Yuka Sasaki, Jose E. Nanez, Takeo Watanabe. 2010. Advances in visual perceptual learning and plasticity. Nature Reviews Neuroscience 11:1, 53-60. [CrossRef] 3. Barbara Anne Dosher, Zhong-Lin Lu. 2009. Hebbian reweighting on stable representations in perceptual learning. Learning & Perception 1:1, 37-58. [CrossRef] 4. Alexander A. Petrov, Barbara Anne Dosher, Zhong-Lin Lu. 2005. The Dynamics of Perceptual Learning: An Incremental Reweighting Model. Psychological Review 112:4, 715-743. [CrossRef] 5. Misha Tsodyks, Charles Gilbert. 2004. Neural networks and perceptual learning. Nature 431:7010, 775-781. [CrossRef] 6. Osamu Hoshino. 2004. Neuronal Bases of Perceptual Learning Revealed by a Synaptic Balance SchemeNeuronal Bases of Perceptual Learning Revealed by a Synaptic Balance Scheme. Neural Computation 16:3, 563-594. [Abstract] [PDF] [PDF Plus] 7. Jason M. Gold, Allison B. Sekuler, Partrick J. Bennett. 2004. Characterizing perceptual learning with external noise. Cognitive Science 28:2, 167-207. [CrossRef] 8. Emre Özgen, Ian R. L. Davies. 2002. Acquisition of categorical color perception: A perceptual learning approach to the linguistic relativity hypothesis. Journal of Experimental Psychology: General 131:4, 477-493. [CrossRef] 9. Anna Schubö, Friederike Schlaghecken, Cristina Meinecke. 2001. Learning to ignore the mask in texture segmentation tasks. Journal of Experimental Psychology: Human Perception and Performance 27:4, 919-931. [CrossRef] 10. Rosario M. Balboa , Norberto M. Grzywacz . 2000. The Minimal Local-Asperity Hypothesis of Early Retinal Lateral InhibitionThe Minimal Local-Asperity Hypothesis of Early Retinal Lateral Inhibition. Neural Computation 12:7, 1485-1517. [Abstract] [PDF] [PDF Plus] 11. G. Mato , H. Sompolinsky . 1996. Neural Network Models of Perceptual Learning of Angle DiscriminationNeural Network Models of Perceptual Learning of Angle Discrimination. Neural Computation 8:2, 270-299. [Abstract] [PDF] [PDF Plus] 12. Shimon Edelman. 1995. Representation, similarity, and the chorus of prototypes. Minds and Machines 5:1, 45-68. [CrossRef] 13. Wolfgang Skrandies, Manfred Fahle. 1994. Neurophysiological correlates of perceptual learning in the human brain. Brain Topography 7:2, 163-168. [CrossRef]
Communicated by James Anderson
Learning to Generalize from Single Examples in the Dynamic Link Architecture Wolfgang Konen Christoph von der Malsburg Institut f i r Neuroinfomatik, Ruhr-Universitat Bochum, Germany A large attraction of neural systems lies in their promise of replacing programming by learning. A problem with many current neural models is that with realistically large input patterns learning time explodes. This is a problem inherent in a notion of learning that is based almost entirely on statistical estimation. We propose here a different learning style where significant relations in the input pattern are recognized and expressed by the unsupervised self-organization of dynamic links. The power of this mechanism is due to the very general a priori principle of conservation of topological structure. We demonstrate that style with a system that learns to classify mirror symmetric pixel patterns from single examples.
1 Introduction Learning is the ability of animals and humans to absorb structure from one scene and apply it to others. The literal storage of whole sensory input fields is of little value since scenes never recur in all detail within our lifetime. Essential for learning is therefore the ability to extract significant patterns from an input field containing mostly patterns with accidental feature constellations, and to apply those significant patterns to the interpretation of later scenes. How can significant patterns be identified? Theories of learning based on layered neural networks [e.g., backpropagation of errors (Rosenblatt 1962,Rumelhart et al. 1986) or the Boltzmann Machine (Ackley et al. 198511 are based on the notion that significant patterns are, above all, recurring patterns. Such systems have an input layer, an output layer, and hidden units. During a learning phase, many examples are presented to input layer and output layer, and the system is enabled by some plasticity mechanism to pick up and represent patterns that recur with statistical sigruficance in the input training set. This method of identifying significant patterns may be the obvious one-going back to the original definition of significance based on recurrencebut with realistic inputs taken from natural environments it is far too costly, in terms of the number of inputs required to discriminate significantpatterns from accidental Neural Computation 5,7l9-735 (1993) @ 1993 Massachusetts Institute of Technology
720
Wolfgang Konen and Christoph von der Malsburg
Figure 1: Symmetrical pixel patterns. Input patterns are arrays of N x N pixels, here N = 8. Pixel II has gray level feature value Fo E { 1, . . . ,Fmx}.In most of our simulations,F,,, = 10. In each input image, pixel values are random, but equal for points symmetrical with respect to one of three axes: (A) horizontal, (8) vertical, (C) diagonal. The system has to solve the task of assigning input patterns to classes according to these symmetries, and to learn this performance from examples. patterns. The reason for this difficulty lies in the combinatorial explosion in the number of subsets that can be selected from large input fields (there are, for instance, ld432possible subsets of size 1000 in a set of 109. Among those subsets there are only relatively few of significant interest (in vision, for example, the criterion of spatial continuity alone singles out relatively very few subsets). There obviously are potent methods, presumably based on a priori knowledge built into the system, to extract significant patterns from a scene. It is generally recognized that methods based purely on scene statistics must be complemented (if not supplanted) by more powerful ones based on a priori structure. One widespread piece of advice is to use input representations that are already adapted to the problem at hand. Down that alley there is, of course, the pitfall of hand-wiring instead of learning the essential structure. The real challenge is to find simple and general architectures that can handle large classes of problems and that can learn with a minimum of scene statistics. The particular problem we are considering here has originally been proposed by Sejnowski et al. (1986). It consists in learning to classify mirror-symmetrical pixel patterns (see Fig. 1). The authors solved the problem with the help of scene statistics. Their system, consisting of a layer of 12 hidden units and 3 output units corresponding to the 3 symmetry classes, learned as a Boltzmann Machine, which is a variant of supervised learning. With input arrays of 10 x 10 pixels the system
Learning to Generalize
721
needed about 40,000 training examples in order to reach a success level of 85%. The system (Sejnowski et al. 1986) demonstrates the strength and the weakness of statistical pattern classification. The strength is full generality with respect to possible patterns. This is best demonstrated with the thought experiment of applying a permutation to the pixels in the input field-the same permutation to all patterns presented. The system would now have the same learning performance, in spite of the complete destruction of topological structure. The weakness is an explosion in the number of examples required when scaling to larger input arrays. This weakness the system shares with a wide class of learning algorithms, which all are based on the statistical detection of classes as clusters in input space and their subsequent representation by single prototypes. Prominent examples are the k-nearest neighbor (k") algorithm (Fix and Hodges 1951; Cover and Hart 1967), the RCE algorithm (Reilly et al. 1982), which is a neural version of k",and adaptive vector quantization (LVQ, LVQ2) (Kohonen et al. 1988). None of those algorithms can easily deal with the symmetry classification problem. The reason is that already with modest problem size there are astronomicalIy many patterns in a symmetry class for the 8 x 8 pixels of 10 features each in Fig. 1) and that these do not form clusters in input space and thus cannot be detected in a small training set. It is this that leads to the exploding thirst in learning time and number of prototypes. Our treatment of the problem is based on the Dynamic Link Architecture (DLA) (von der Malsburg 1981). The strength of the DLA essential in the present context is its ability to detect pattern correspondences. An application of this ability to the problem of invariant pattern recognition has been reported previously (Bienenstock and von der Malsburg 1987; von der Malsburg 1988; Lades et al. 1993). Here we demonstrate that with it symmetry classes can be recorded from single examples for later recognition. Our treatment is based on the a priori restriction that significant relations within the input pattern are those which preserve topological structure. It is in this sense less general than the Boltzmann Machine, not being able to deal with the permutation symmetries mentioned above. On the other hand its extreme speed of adaptation to new symmetries makes it more potent than the Boltzmann Machine. Most of what is achieved in other neural systems with the help of statistical learning is performed here by the self-organization of an explicit representation of the symmetry in the input pattern. 2 Symmetry Detection by Dynamic Link Matching-Qualitative Model
Dynamic link matching is capable of finding and representing topological, feature-preserving mappings between parts of the input plane.
722
Wolfgang Konen and Christoph von der Malsburg
Such mappings are systems of pair-wise links that are neighborhoodpreserving and that connect pairs of points with the same local properties in the input pattern. In this section we describe the network and its function qualitatively and establish its relationships to other, previously published models (von der Malsburg 1988; Bienenstock and von der Malsburg 1987; Buhmann et al. 1989; Lades et al. 1993; von der Malsburg and Buhmann 1992), and to the circuitry of cortical structures. In the next section we will describe an explicit quantitative, though somewhat simplified, model. The network resembles primary visual cortex in representing visual images in a columnar fashion: Each resolution unit (“pixel“) of the sensory surface is subserved by a collection (“column”) of neurons, each neuron reacting to a different local feature. (In our concrete model, local features will simply be gray values. In a more realistic version, features would refer to texture, color, and the like.) There are intracolumnar connections, whose function will be explained below, and intercolumnar connections. The latter are what we will refer to as “dynamic links,” are of rather large range in visual space, and are restricted to pairs of neurons with the same feature type. (In our explicit model the connections will run between cells responding to the same gray value in the image.) When a pattern is presented as visual input, those neurons in a column are selected that code for a feature that is present in the corresponding pixel. We refer to the selected cells as “preactivated” neurons. The set of all preactivated neurons represents the input image. During the presentation of an image, the preactivated cells are actually not allowed to fire all at the same time. Rather, activity in the network takes the form of a sequence of “blob activations.” During a blob activation, only those preactivated neurons are permitted to fire that lie in a group of neighboring columns. A blob activation corresponds to the “flash of the searchlight of focal attention” discussed, for instance, by Crick (Crick 1984). In the absence of any other control of attention, blob activations are created spontaneously in random positions in a rapid sequence of “cycles.” When a blob is active, its active cells send out signals that excite preactivated neurons of the same feature type in other locations. Thus, within the total network those preactivated neurons are excited whose type is represented in the active blob. Most of these cells form a diffuse spray over the image domain. If there is a symmetry in the image, however, there will be a location where all the feature types in the active blob are assembled locally again. With appropriate dynamics, those neurons are activated as well, forming a ”shadow blob.” The network thus has discovered the significant relationship between two symmetrical regions in the image, and with the help of rapid synaptic plasticity in the intercolumnar connections (“dynamic links”) it is possible to record it, simply strengthening the synaptic connections between all pairs of neurons lying one in each blob. During a sequence of many blob pairs, a full consis-
Learning to Generalize
723
tent system of point-to-point connections will get established, forming a topological mapping between the symmetric parts of the image. This sequence of events constitutes the dynamic link mapping mechanism. It is very robust. Occasional erroneous blob pairs are of little consequence whereas all correct blob pairs form a cooperative system of mutual reinforcement. Once the covering of the image with blobs is fairly complete the plexus of reinforced connections stabilizes signal correlations between symmetric points and, as our simulations show, false blob pairs do no longer occur. For each new image (or for each new fixation of an image, for that matter), a new mapping of dynamic links has to be built up. A slow, and simpler, version of the dynamic link mapping mechanism was first described in Willshaw and von der Malsburg (1976) to account for the ontogenetic establishment of retinotopic mappings from retina to tectum. A dynamic link mapping system using feature labels has later been proposed as a solution to the problem of invariant object recognition (von der Malsburg 1988; Bienenstock and von der Malsburg 1987; Buhmann et al. 1989; Lades et al. 1993). As a mapping system, the present model goes beyond previous work in needing dramatically fewer activation cycles. The columnar connectivity pattern described here was introduced as part of a proposed solution to the figure ground segmentation problem (Schneider 1986; von der Malsburg and Buhmann 1992). In the explicit model described below some network details are just necessary to realize the qualitative behavior described above. Others, however, we introduced to simplify the dynamics of our system. Prominent among these is the introduction of an “activator cell” (or X-cell) and a ”collector cell” (or Y-cell) for each column (see Fig. 2A). The activator cells spontaneously create the active blob and activate all sensorily preactivated neurons in their column. The collector cells sum up all activity that arrives in the preactivated neurons of their column and that comes from the active blob, and they interact to form the shadow blob. Also, active collector cells gate the preactivated neurons in their columns into full activity. The presence of activator cells and collector cells ensures that all preactivated neurons in a column make their firing decision together. Global inhibition between all activator cells and between all collector cells ensures that there is exactly one active blob and exactly one shadow blob at any one time. An activator cell is kept by a compensating inhibitory connection from exciting the collector cell of its own column via its feature cells. In our explicitly simulated network described below we make the simpllfying assumption that during the presentation of an image exactly one of the feature cells in a column is active (corresponding to one of a number of possible gray values). As a consequence, at most one intercolumnar connection is active between two columns at any one time (exactly when the two columns are preactivated with the same gray value). This justifies our introduction of “compound connections” from the ac-
724
Wolfgang Konen and Christoph von der Malsburg
layer Y feature cells layer X pattern
Figure 2: Architecture of the dynamic link network. (A) The complete architecture. The columns in two positions, a and b are shown. Feature cells are preactivated by the pattern presented. Columns are connected with each other by feature-preservinglinks. These links are rapidly modifiable ("dynamic links"). Both the activator cells (layer X)and the collector cells (layer Y) have short-range excitatory and long-range inhibitory connections (not shown) and each have the tendency to form a local blob. Coupling from an X-cell a to a Y-cell b is via the preactivated cells in column a, intercolumnar links, and the preactivated cells in column b. (B)In our case, where only one feature is active per column, a functionally equivalent description uses the effective connections JhTh, where T h encodes the feature similarity between image positions u and b (cf. equation 3.1), and ]h is the rapidly modifiable strength of the dynamic link. tivator cells to the collector cells, treating all columnar quality cells and their connections implicitly (see Fig. 28). 3 Symmetry Detection by Dynamic Link Matching-Explicit Model
After these preparatory heuristic discussions we are ready to introduce the explicit dynamic link mapping network that we have simulated. It has the following parts (cf. Fig. 2B). Our image domain is formed by a grid of 8 x 8 pixels. Positions in the image domain are designated by letters u,b, . . . . An input image is described by a distribution of features F, over the pixel positions a, where Fa E (1,. . . ,F,,,} (see Fig. 1). The image domain is covered by two layers of cells, the X-layer and the Y-layer. The
Learning to Generalize
725
connection from cell a in the X-layer to cell b in the Y-layer is controlled by the dynamic link variable Jh, which is subject to the dynamic described below. The constraint of feature specificity is formulated with the help of the similarity constraint matrix ~ b = a
{
if F, else
1 0
= Fb
and b # a
(3.1)
The total connection from cell a in the X-layer to cell y in the Y-layer is described by the “effective coupling” JhTb. The activities of cells are designated xa or ya. Both layers have homogeneous internal connections of the form &at
= Gaal
-P
(3.2)
Here, G, is a short-range excitatory connection kernel, and P is the strength of a long-range (here: global) inhibitory connection. For both X and Y we assume wrap-around boundary conditions. The dynamic of the X-layer is governed by the differential equations (3.3)
Here, S ( x ) is a sigmoidal nonlinearity that saturates at S ( x ) = 0 for low x and at S ( x ) = 1 for high x, whereas p is a constant excitation. The dynamic of the Y-layer is governed by the differential equations (3.4)
With given effective connections and small noisy initial values (as a model for spontaneous activity) for the xa, the activator and collector cell activities evolve on a fast time scale towards an equilibrium distribution in the form of local blobs of active cells ( S x 11, with the rest of the cells in the layer X or Y inactive ( S M 0). The size of the blob is controlled by the parameters a, P, and o,whereas their position is determined by the noise input in the case of X and by the profile of the activation in the case of Y. Once the activity in X and Y has settled, the dynamic link variables Jba are modified by three consecutive substitutions: (3.5)
The first step encapsulates the general idea of Hebbian plasticity, though regulated here by the constant E for the rapid time scale of a single image presentation. After the second and third steps the new connections conform to divergent and convergent sum rules.
726
Wolfgang Konen and Christoph von der Malsburg
When an image is presented, the full sequence of events is the following. First, the connections Ih are initialized with a constant value conforming to the sum rules. Then a number of activity-and-modificationcycles are carried through. For each of these, the X-activities are initialized with noise distribution, the Y-activities are reset to 0, and the dynamics of X and Yare run according to equations 3.3 and 3.4 to reach stationary values. Then the dynamic links are updated according to equation 3.5. After typically 50-80 such cycles the dynamic links relax into a stable configuration displaying the underlying symmetry of the actual input image. For a typical result see Figure 4. The network is now ready for permanently recording the symmetry type if it is new, or for recognizing it according to a previously recorded type. If a link Jh is active, the activity dynamics of equations 3.3 and 3.4 produces correlated activity in the connected cells: In the stationary state towards the end of each cycle, cells a and b are always active or inactive together. In comparison to the dynamic links, activity correlations have the distinction of graceful degradation: Even if a single link is corrupted, the correlation between the corresponding x and y cells is high if there are strong links in the neighborhood (remember that an activity blob always covers a neighborhood along with a given cell).
4 Recording and Recognizing a Symmetry
The main task necessary for solving the symmetry recognition problem is solved for our model by the unsupervised process of dynamic link mapping described in the last section. For a given symmetric pattern it constructs a temporary representation in the form of a set of active links. This set is the same for all input patterns belonging to the same symmetry class. In order to record a symmetry type it is now simply necessary to create hidden units as permanent representatives for some of the active links (or rather the correlations created by them) and to connect them to appropriate output units. Once a symmetry type has been represented by such a network, its second occurrence can be detected and the system is ready to recognize all patterns of this symmetry type as such. O u r recognition network structure is similar to the one used in Sejnowski et al. (1986)and is shown in the upper panel of Figure 3. It consists of three output units Ck,k = 1 , 2 , 3 (sufficient for three symmetry types) and, connected to each output unit, 6 hidden units. Each hidden unit i has a randomly chosen fixed reference cell a ( i ) in X and plastic synapses Wjb from all cells b in Y.' The output hi of a hidden unit is driven by a 'In principle, the number of hidden units per output cell could be one. Recognition is more reliable and faster, however, if the density of reference cells a ( i ) is large enough so that most of the active blobs in X hit at least one of them.
earning to eneralize
72
Figure 3: The complete system. An input pattern (lowest layer) is represented by sets of preactivated neurons in the feature columns (marked here by heavy outline, on the basis of gray levels). Columns are connected by featurepreserving dynamic links (intercolumnararrows). The dynamic link mechanism creates simultaneous blobs in layers X (“active blob) and Y (“shadow blob”) in symmetrically corresponding positions (hatched areas). The symmetry type is then recorded (when it is new) or recognized (when already known) in the classification network (upper part). There are six hidden units per output unit (only four of which are shown). Each hidden unit has one fixed connection to its output unit, one connection from a randomly chosen X-cell, and has plastic onnections wib from all Y-cells. These latter connections record a symmetry type permanently, by being concentrated into the Y-region lying symmetrically o the location of the X-in ut of the hidden unit.
Wolfgang Konen and Christoph von der Malsburg
728
coincidence of activity xa(i) of its reference cell in X and activity within its receptive field Wib in Y (4.1) In recording mode, hidden units modify their Y-connections Wib at the end of each activity cycle according to the Hebbian plasticity rule: AWjb=QS(yb)
if hi > 8 and
c k
>0
(4.2)
Synaptic plasticity is supervised in the sense that only those hidden units mod* their connections whose output unit Ck is currently activated by a teacher (the role of the teacher simply being to fixate attention on one group of hidden units during the presentation of one pattern). In this way, a hidden unit whose X connection is hit by a blob learns to associate with it the corresponding blob in the Y plane. The whole process is completed for a symmetry type during one presentation (or in some cases two presentations, see below). In recognition mode, the output units perform a leaky integration of the sum of the activities (equation 4.1) of their group of hidden units. After a number of cycles, the output unit with maximal signal is taken to indicate the class into which the input pattern falls. 5 Simulation Results
Simulations of the model were carried out for input patterns of size 8 x 8. The parameters for the blob formation process in equations 3.3 and 3.4 were adjusted to let the equilibrium blobs cover between 25 and 40% of the layer area; for example, with {o,p, e, Q , C , p , 8) = { .3, .85,1.8,.02, .8, .6, ,125) blobs cover 25% of their layer. As convolution kernel G , ! we used a gaussian of width 4 and strength 2.1, restricted, however, to a window of 5 x 5 pixels. For almost all input patterns, self-organization of the correct mapping J from X to Y was observed. Figure 4 shows a typical example in some stages of the organization process. The degree of organization can be measured quantitatively by the correlation between corresponding cells, which is shown in Figure 5 for a specific input example. During the first 40-50 activation cycles the correlation builds up and reaches almost the theoretical optimum 1. Thus, during all further cycles symmetrically corresponding points in X and Y are marked by strong correlations, whereas pairs of units in far-from-symmetrical positions would have correlation -1. After learning the specific symmetries from either one or two training examples, the network can generalize almost perfectly to new input patterns of the same symmetry class. Figure 6A shows the classification performance on 200 new examples. There is a clear tradeoff between the reliability of recognition and the required time (in terms of activation
Learning to Generalize
729
Figure 4 Dynamic link mapping. The network, with layers X (in front) and Y (in the rear) in different activation states, after 15 (A), 50 (B), and 80 (C)activity cycles, all generated for a fixed input pattern of symmetry class A (cf. Fig. 1). The dynamic link mapping process is based on a sequence of blob activations (white circles). Dynamic links Jh E [0,1]grow between temporally correlated cells. Only links with jtyl 2 0.4 are shown in the figure. cycles). In principle, one example can supply sufficient information for this performance. However, with our parameter settings two examples gave slightly more reliable results (see Fig. 6A). Our network achieves a recognition reliability of 98%. Its level of reliability is only weakly affected by perturbations to the feature similarity matrix T up to t = 40% (Fig. 6B). This is due to the robustness of the dynamic link mapping mechanism (see Fig. 5), which creates near-toperfect signal correlations between symmetric points. Since the hidden units are trained by these correlations, the presence of perturbations in the matrix T even during learning does not affect the performance of the system. We have verified numerically that after training the hidden units with t = 40% the performance is virtually the same as in Figure 6B, for example, 93% reliability if the recognition is forced after 100 cycles. 6 Discussion
We have presented here a network that is able to discover a system of relations in individual input patterns and to immediately generalize to
Wolfgang Konen and Christoph von der Malsburg
730
1.o
0.8 0.6
no perturbation (t=O) 0 : 40% perturbation (t=0.4) a:
0.4
0.2 0.0 0
1 20 40 60 80 100 120 number n of activation cycles
Figure 5: Mean correlation between pairs of corresponding cells in layer X and layer Y for a given state of the dynamic links J after n activation cycles (blob pairs). Correlation is computed as
with Ax = d ( ( x - ( x ) ) ~ ) and , s(a) denoting the the cell that lies symmetrically to a. To measure the correlation after n activation cycles, the link state {Ibp}is frozen after n cycles (by setting E = 01, while the blob activation cycles continue. x, and ys(a)are the equilibrium activities of cells a and s(a), respectively, and (.) denotes averaging over many cycles. Possible correlation values range from -1 for perfect anticorrelation to 1 for perfect correlation. What is displayed is the mean of C ( X # , ~ ~with ( ~ ) )respect to all possible positions a, and error bars denote the statistical errors when averaging over 900 cycles. Filled circles: Perfect feature similarity matrix Ttyl E {O,l}. Open circles: All matches Ttyl = 1 are replaced by random values Ttyl E [l - t,l], all nonmatches Ttyl = 0 by a random Tbp E [O,t], to mimic the effects of noisy feature information. The correlations are robust against this perturbation.
further examples of the same type. The network is based on dynamic link mapping. The self-organization of dynamic links in our model is extremely fast, requiring much less than 100 update cycles. This is due to the use of local feature information in conjunction with a topology constraint. For simplicity, we have used gray-values of single pixels as
Learning to Generalize
731
2 examples per class
60
401
I
1 example per dass
20
ot
0
. . . , . . . , . . . , . . . , . . .1 20 40 60 80 100 number n of activation cycles
0
I
nopefhJfmion(bO)
M% perturbation(W.2)
+ 40% perturbation(tro.4)
ot 0
. . . , . . . , . . . , . . . , . . . 1 20 40 60 80 100 number n of activation cycles
(B)
Figure 6 Symmetry recognition performance. A total of 200 input patterns are classified according to one of three possible symmetries (cf. Fig. 1). The symmetry types have been recorded previously. The percentage of correct decisions is displayed as a function of the number n of activation cycles until the decision is forced. (A) Unperturbed features, T h E (0, l}, training with k = 1 or k = 2 examples per class, 120/k learning steps according to equation 4.2 for each example. (8) Influence on performance of perturbations in the feature similarity matrix T during recognition: The network can tolerate perturbations of t = 20% or even t = 40%, where t is defined as in Figure 5.
732
Wolfgang Konen and Christoph von der Malsburg
our visual features. In applications to large pixel arrays this would be impractical. The number of dynamic links in the matrix J would have to grow with the fourth power of the linear extent of the input plane. However, if one replaced the gray-level sensitivity of our feature cells by extended receptive fields [e.g., of the Laplace type with a hierarchy of spatial scales, in analogy to the feature jets of (Buhmann et a2. 1991)l one could cover the input plane with a fairly low-density set of sampling points and correspondingly operate with manageably small X and Y planes. The main goal of our paper is to make a point about the learning issue, symmetry detection merely playing the role of an example. It may be interesting, though, to briefly discuss symmetry detection in humans and animals. Infants can detect symmetry at the age of four months (Bornstein et al. 1981). Pigeons learn to discriminate symmetry in very few trials (Delius and Nowak 1982), although one may suspect that they already come equipped with the ability of detecting symmetry and only have to be conditioned for the appropriate response. Our system may shed new light on the old discussion of nature vs. nurture with respect to the symmetry detection issue: Our demonstration that learning time could be extremely short makes it impossible to decide the issue by demonstrating the capability in naive or newborn subjects. At first sight it is tempting to take our system directly as a model for symmetry detection in primary visual cortex, identlfying all of our cell types (Xand Y cells, feature cells and hidden units) with neurons found in cortical hypercolumns in area V1. This view would run, however, into a number of difficulties. One of them is the need, in our model, for long-range connections (intercolumnar links and the wib connections from Y cells to hidden units). With respect to area V1 this requirement creates a conflict, especially in view of the fact that humans are better at detecting symmetry around the vertical midline than around the horizontal (Barlow and Reeves 1979), and callosal connections are absent within V1 except for a narrow seam along the vertical meridian. [This difficulty is mitigated, though, by the fact that symmetry detection in humans relies mainly on a small strip along the symmetry axis of the input pattern, at least in random dot patterns (Julesz 19751.1 The problem can be largely avoided by placing our model in a later area in which larger visual angles are spanned by horizontal fibers. Another hint to this effect is the observation that symmetry detection in humans may be based not on distributions of gray levels directly but rather on object shapes reconstructed from shading (Ramachandran 1988). Another difficulty for a direct biological application of our model [which it shares with the one of Sejnowski et al. (1986)l is its lack of invariance with respect to displacement of the symmetry axis, as for instance caused by eye movements during inspection of a pattern. All of these difficulties point to a slightly more complicated model, which would, however, obscure our main point.
Learning to Generalize
733
O u r system is not limited to mirror symmetry. It could equally record and recognize other pattern relations such as simple duplication with or without rotation (or, in a system of only slightly more general form, expansion). Humans, on the other hand, perform much worse on these (Corballis and Roldan 1974). The reason for such differences may be a rather superficial one. Even if the ontogeny of symmetry detection is of the nature we are putting forward here, the system will after some experience be dominated by the hidden units it has acquired. Once these have sufficient density, the dynamic link mechanism is no longer necessary for the recognition of familiar pattern relations [the same way the correct hidden units in Sejnowski et al. (1986) are activated directly by the known symmetries]. The relative performance on different problem types is then dominated by experience rather than by the nature of the ontogenetic mechanism. This would explain our bias toward vertical mirror symmetry. The heavy reliance of humans on a strip around the symmetry axis mentioned above may point to a mechanism relying on memorized symmetric shapes, such as butterfly patterns and the like, formed on the basis of a general learning mechanism but soon supplanting it by being more rapidly detected. The structure of our model fits very well the general character of cortical columnar organization [as also employed in von der Malsburg and Buhmann (1992)l. Of central importance to our system is the encoding of significant relations with the help of temporal signal correlations. Candidate correlations of an appropriate nature have been observed in visual cortex (Gray et al. 1989; Eckhorn et al. 1988). The model may thus give support to the idea that correlations play an important functional role in cortical information processing. The central point that we would like to make here refers to the general learning issue. The origin of knowledge in our mind has puzzled philosophers for centuries. Extreme empiricism is not tenable. Its most concrete formulation, nonparametric estimation, shows that it requires astronomical learning times. The opposite extreme, assuming all knowledge to be present in the brain at birth, is equally untenable, not doing justice to the flexibility of our mind, and just putting the burden of statistical estimation on evolution. The only possible way out of this dilemma is the existence of general principles simple enough to be discovered by evolution and powerful enough to make learning a very fast process. This can only be imagined in a universe with profound regularities. The one we are exploiting here is the wide-spread existence of similarities between simultaneously visible patterns. This regularity is captured in the rather simple structure of our network, enabling it to generalize from single examples of symmetrical patterns, in striking contrast to the system of Sejnowski et al. (1986), which is based on statistical estimation. With small modifications, dynamic link mapping can be used for the purpose of object recognition invariant with respect to translation, rotation and distortion, making the step from the correspondence of simultaneous
734
Wolfgang Konen and Christoph von der Malsburg
patterns to those of consecutive patterns. Again, those transformations could be learned from few examples. The very simple a priori principles incorporated in the learning system that we have presented are feature correspondence, topology, and rapid synaptic plasticity. We feel that it is struchiral principles of this general style that make natural brains so extremely efficient in extracting significant structure from complex scenes. Although statistical estimation certainly plays a role for animal learning, it can evidently not be its only basis-natural scenes are too complex, and it is impossible to keep track of the whole combinatorics of subpatterns. Potent mechanisms are required to identify significant patterns already within single scenes. Ours may be a candidate.
Acknowledgments Supported by a grant from the Bundesministerium fur Forschung und Technologie (413-5839-01 IN 101 B/9), and a research grant from the Human Frontier Science Program.
References Ackley, D. H., Hinton, G. E., and Sejjowski, T. S. 1985. A learning algorithm for Boltzmann machines. Cogn. Sci. 9,147-149. Barlow, H. B., and Reeves, B. C. 1979. The versatility and absolute efficiency of detecting mirror symmetry in random dot displays. Vision Res. 19,783-793. Bienenstock, E., and von der Malsburg, C. 1987. A neural network for invariant pattern recognition. Europhys. Lett. 4,121-126. Bornstein, M. H., Ferdinandsen, K., and Gross, C. G. 1981. Perception of symmetry in infancy. Dev. Psychol. 17, 82-86. BuhmaM, J., Lange, J., and von der Malsburg, C. 1989. Distortion invariant object recognition by matching hierarchically labeled graphs. In IJCNN International ConferenceonNeural Networks, Washington, pp. 155-159. IEEE, New York. Buhmann, J., Lange, J., von der Malsburg, C., Vorbriiggen, J. C., and Wiirtz, R. P. 1991. Object recognition in the dynamic link architecture-parallel implementation on a transputer network. In Neural Netwrks: A Dynamical Systems Approach to Machine Intelligence, B. Kosko, ed., pp. 121-160. Prentice Hall, New York. Coolen, A. C. C., and Kuijk, F. W. 1989. A learning mechanism for invariant pattern recognition in neural networks. Neural Networks 2,495. Corballis, M. C., and Roldan, C. E. 1974. On the perception of symmetrical and repeated patterns. Percept. Psychophys. 16, 136-142. Cover, T.M., and Hart, P. E. 1967. Nearest neighbor pattern classification. IEEE Transact. Inform. Theory IT-13,21-27.
Learning to Generalize
735
Crick, F. 1984. Function of the thalamic reticular complex: The searchlight hypothesis. Proc. Natl. Acad. Sci. U.S.A. 81, 4586-4590. Delius, J. D.,and Nowak, B. 1982. Visual symmetry recognition by pigeons. Psychol. Res. 44, 199-212. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., and Reitboeck, H. 1988. Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern. 60, 121. Fix, E., and Hodges, J. L. 1951. Discriminatory analysis, non-parametric discrimination. Tech. Rep. USAF School of aviation medicine, Project 21-49-004. Rept. 4. Gray, C. M., Konig, P., Engle, A. K., and Singer, W. 1989. Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature (London) 338,334-337. Julesz, 8. 1975. Experiments in the visual perception of texture. Sci. Am. 4. Kohonen, T., Barna, G., and Chrisely, R. 1988. Statistical pattern recognition with neural networks: benchmarking studies. Proceedings of the IEEE ICNN, San Diego. Lades, M., Vorbriiggen, J. C., Buhmann, J., Lange, J., von der Malsburg, C., Wurtz, R. l?, and Konen, W. 1993. Distortion invariant object recognition in the dynamic link architecture. IEEE Transact. Computers 10, 300. Ramachandran, V. 1988. Perceiving shape from shading. Sci. Am. 10,76-83. Reilly, D.L., Cooper, L. N., and Elbaum, C. 1982. A neural model for category learning. Biol. Cybern. 45, 35-41. Rosenblatt, F. 1962. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, DC. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by backpropagating errors. Nature (London) 323,533-536. Schneider, W. 1986. Anwendung der Korrelationstheorieder Hirnfunktion auf das akustische Figur-Hintergrund-Problem(Cocktailparty-Effekt).Ph.D. thesis, Universitat Gottingen, 3400 Gottingen, Germany. Sejnowski, T. J., Kienker, P. K., and Hinton, G. E. 1986. Learning symmetry groups with hidden units: Beyond the perceptron. Physica 22D,260-275. von der Malsburg, C. 1981. The correlation theory of brain function. Internal report, 81-2,Max-Planck-Institut fur Biophysikalische Chemie, Postfach 2841,3400 Gottingen, Germany. von der Malsburg, C. 1988. Pattern recognition by labeled graph matching. Neural Networks 1, 141-148. von der Malsburg, C., and Buhmann, J. 1992. Sensory segmentation with coupled neural oscillators. Biol. Cybern. 67, 233-242. Willshaw, D. J., and von der Malsburg, C. 1976. How patterned neural connections can be set up by self-organization. Proc. R. SOC.London B194,431-445.
Received 1 July 1992; accepted 12 February 1993.
This article has been cited by: 2. T. Aonishi, K. Kurata. 2000. Extension of dynamic link matching by introducing local linear maps. IEEE Transactions on Neural Networks 11:3, 817-822. [CrossRef] 3. Toru Aonishi, Koji Kurata. 1998. Deformation Theory of Dynamic Link MatchingDeformation Theory of Dynamic Link Matching. Neural Computation 10:3, 651-669. [Abstract] [PDF] [PDF Plus] 4. Laurenz Wiskott, Terrence Sejnowski. 1998. Constrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and NormalizationConstrained Optimization for Neural Map Formation: A Unifying Framework for Weight Growth and Normalization. Neural Computation 10:3, 671-716. [Abstract] [PDF] [PDF Plus] 5. B. Parvin, G. Medioni. 1996. B-rep object description from multiple range views. International Journal of Computer Vision 20:1-2, 81-112. [CrossRef]
Communicated by William W. Lytton
Neural Network Modeling of Memory Deterioration in Alzheimer’s Disease D. Horn School of Physics and Astronomy, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
E. Ruppin Department of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
M.Usher CNS program, Division of Biology 21 6-76, Caltech, Pasadena, CA 91125 USA M.Herrmann Sektion Informatik, UniversitlCt Leipzig, PSF 920,D-0-7010 Leipzig, Germany
The clinical course of Alzheimer’s disease (AD) is generally character ized by progressive gradual deterioration, although large clinical variability exists. Motivated by the recent quantitative reports of synaptic changes in AD, we use a neural network model to investigate how the interplay between synaptic deletion and compensation determines the pattern of memory deterioration, a clinical hallmark of AD. Within the model we show that the deterioration of memory retrieval due to synaptic deletion can be much delayed by multiplying all the remaining synaptic weights by a common factor, which keeps the average input to each neuron at the same level. This parallels the experimental observation that the total synaptic area per unit volume (TSA) is initially preserved when synaptic deletion occurs. By using different dependencies of the compensatory factor on the amount of synaptic deletion one can define various compensation strategies, which can account for the observed variation in the severity and progression rate of AD. 1 Introduction Alzheimer’s disease (AD) is the major degenerative disease of the brain, responsible for a progressive deterioration of the patient’s cognitive and motor function, with a grave prognosis (Adams and Victor 1989). Its clinical course is usually characterized by gradual decay, although both slow and rapidly progressive forms have been reported, exhibitinga large Neural Computation 5,736-749 (1993) @ 1993 Massachusetts Institute of Technology
Neural Network Modeling in AD
737
variation in the rate of AD progression (Drachman et al. 1990). While remarkable progress has been gained in the investigation of neurochemical processes accompanying AD, their role in neural degeneration, the main pathological feature of AD, is yet unclear (Selkoe 1987; Kosik 1991). This work is motivated by recent investigations studying in detail the neurodegenerative changes accompanying AD, on a neuroanatomical level. Following the paradigm that cognitive processes can be accounted for on the neural level, we examine the effect of these neurodegenerative changes within the context of a neural network model. This allows us to obtain a schematic understanding of the clinical course of AD. Neuroanatomical investigations in AD patients demonstrate a considerable decrease in the synapse to neuron ratio, due to synaptic deletion (Davies et al. 1987; Bertoni-Freddari et al. 1990). Synaptic compensation, manifested by an increase of the synaptic size, was found to take place concomitantly, reflecting a functional compensatory increase of synaptic efficacy at the initial stages of the disease (Bertoni-Freddari et al. 1988, 1990; DeKosky and Scheff 1990). The combined outcome of these counteracting synaptic degenerative and compensatory processes can be evaluated by measuring the total synaptic area per unit volume (TSA), which was shown to correlate with the cognitive function of AD patients (DeKosky and Scheff 1990). Our model, presented in Section 2, serves as a framework for examining the interplay of synaptic deletion and compensation. This attractor neural network (ANN) is not supposed to represent any specific neuronal tissue, yet we believe that our results are relevant to a large class of neural systems. Deletion is carried out stochastically by removing the fraction d of all synaptic weights. Compensation is modeled by multiplying all remaining synaptic weights by a common factor c. The TSA value is proportional to c(1 - d ) . Varying c as a function of d specifies a compensation strategy. We assume that the network’s failure rate, measured by the fraction of memories erroneously retrieved, represents the degree of “cognitive deficit” in clinical observations. Reviewing the pertaining pathological and clinical data, we show in Section 3 how our model can account for the variability observed in the clinical course of AD. Our results are further discussed in Section 4. 2
The Model
Concentrating on memory degradation, a clinical hallmark of AD (Adams and Victor 19891, we use as our theoretical framework a neural network model of associative memory. Our model is based on the biologically motivated variant of Hopfield’s model (1982), proposed by Tsodyks and Feigelman (1988). In an ANN,the stored memories are attractors of the network’s dynamics, such that when memory retrieval is modeled then, starting from an initial condition sufficiently similar to one of the memory
D.Horn et al.
738
patterns, the network flows to a stable state identical with that memory. The appeal of attractors, as corresponding to our intuitive notion of the persistence of cognitive concepts along some temporal span, has been fortified by numerous studies testifying to the applicability of ANNs as models of the human memory [for a review see Amit (1989)1, and is also supported by biological findings of delayed, poststimulus, sustained activity (Fuster and Jervey 1982; Miyashita and Chang 1988). All N neurons in the network have a uniform positive threshold T. Each neuron is described by a binary variable S = (1,O) denoting an active (firing) or passive (quiescent) state, respectively. M = aN distributed memory patterns [ p are stored in the network. The elements of each memory pattern are chosen to be 1 (0)with probability p (1 - p ) respectively, with p << 1. The weights of the synaptic connections are
w.. - C([Fi -p ) ( [ g -p ) ” - N,=,
(2.1)
The updating rule for neuron i at time t is given by (2.2)
where 8 is the step function. The performance of the network is measured by the activities of the memories, as defined by the overlaps mp, N
As shown in the Appendix, there exists an optimal value of the threshold T = p(1 -p)(l-2~)/2, which ensures the best performance of the network. Starting with such a memory model we introduce synaptic deletion by randomly deleting some of the incoming synapses of every neuron, leaving each neuron with 1 = (1- d)N input connections, where d < 1 is the deletion facfor. Synaptic compensation is modeled by multiplying the weights of the remaining synaptic connections by a uniform compensation factor c > 1. This changes the dynamics of the system to
where Di denotes a random set of indices corresponding to neurons to which the ith neuron is connected, and IDil/N = 1 - d 5 1. T remains the same value as before. In the network’s “premorbid state, the memories have maximal stability, achieved by choosing the optimal threshold T that maximizes the increase of the overlap (say ml), as shown in the Appendix. When the network is initialized with an input pattern that is a corrupted version
Neural Network Modeling in AD
-0.05
739
0
0.05
0.1
h
Figure 1: The distribution of the postsynaptic potential ( p = 0.1,a = 0.05). Solid curve: Initial state: two gaussian distributions peaked at -$(1 - p ) and p(1 - P ) ~ . The optimal threshold T = p(1 - p ) ( i - p ) lies in the middle between the two gaussian mean values. Dashed curve: After deletion (d = 0.25), the new peaks of the postsynaptic potential are no longer equidistant from the threshold T. Dot-dashed curve: The O K strategy restores the initial mean values of the postsynaptic potential (d = 0.75).
of one of the stored memory patterns (e.g., ['I, it will flow dynamically into the attractor given by this memory. To obtain an intuitive notion of the network's behavior when synaptic deletion and compensation are incorporated consider Figure 1. The neurons that stand for firing neurons in the stored memory, and the neurons that stand for quiescent neurons in the stored memory, have distinct postsynaptic potential distributions (the solid curves in Fig. 1). When synaptic deletion takes place, the mean values of the neurons' postsynaptic potential change, and the threshold is no longer optimal (see dashed curves in Fig. 1). Multiplying the weights of the remaining synaptic connections by an optimal performance Compensation factor ( O K ) c = 1/(1- d), restores the original mean values of postsynaptic potential and the optimality of the threshold (dot-dashed curves in Fig. 1). The accom an in increase in the variance of the postsynaptic potential, which is times larger than the original one, leads, however, to performance deterioration. This is further elucidated in the Appendix.
Je
D. Horn et al.
740
1 .o
0.6
0.6
0.4
0.2
0.0
0
0.2
0.6
0.4
0.6
1
d
Figure 2: Performance of a network with fixed k compensation. Starting h m an initial state that is a corrupted version [m'(O) = 0.81 of a stored memory pattern, we define performance as the percentage of cases in which the network converged to the correct memory. The simulation parameters are N = 800 neurons, a = 0.05, and p = 0.1. The curves represent (from left to right) the performance of fixed strategies with increasing k values, for k = 0,0.25,0.375,0.5,0.625,0.75,1. The horizontal dotted lines represent performance levels of 25 and 75%.
We can interpolate between the case of deletion without compensation and the OPC within a class of compensatory strategies, defined by (2.5)
with the parameter 0 5 k 5 1. All the fixed k strategies, examined via simulations measuring the performance of the network at various deletion and compensation levels, display a similar sharp transition from the memory-retrieval phase to a nonretrieval phase, as shown in Figure 2. Varying the compensation magnitude k merely shifts the location of the transition region. This sharp transition is similar to that reported previously in cases of deletion without compensation in other models (Canning and Gardner 1988; Koscielny-Bunde 1990).
Neural Network Modeling in AD
741
1.0
0.8
0.6 Y
0.4
0.2
0.0
0
0.2
0.4
0.6
0.8
1
d
Figure 3 The critical transition range in the (k,d) plane. The solid curves represent performance levels of 75 and 25%, derived from Figure 2. The straight lines describe the variations employed in two variable compensations presented in Figure 4.
Figure 3 describes the transition region as a map in the (k,d)plane. The performance levels read off Figure 2 delineate the domain over which deterioration occurs. Staying close to the upper boundary of this domain defines a compensation strategy that enables the system to maintain its performance, with a much smaller amount of synaptic strengthening than that required by the OPC strategy. In the face of restricted compensatory resources, such an optimal resource Compensation strategy (ORC) could be of vital importance. The essence of such ORC strategy is that k is varied as synaptic deletion progresses, in order to retain maximal performance with minimal resource allocation. In Figure 4, we present the performance of two variable k compensation strategies, which we propose to view as expressions of (albeit unsuccessful) attempts of maintaining an ORC. These examples, indicated in Figure 3, include a “gradually decreasing” strategy defined by the variation k = 0.3 O.M, and the “plateau” strategy defined by the variation k = d. The analogs of these strategies can be found in clinical observations, as shown in the next section where we review the biological and clinical evidence relevant to our model.
+
1.0
2.'
'
I
'
.:. . \. .
'
'
'
I
'
"
'
I
"
'
'
I
'
'
'
'
I I
i'
...
Figure 4: Performance of a network with gradually decreasing (dotted curve) and plateau (dashed curve) compensation strategies. 3 Clinical Motivation and Implications
As mentioned in the introduction, while synaptic degeneration occurs, the TSA stays constant in some cortical layers at the initial stages of AD. Qualitatively similar synaptic changes have been observed during normal physiological aging, but with significantly lower deletion (BertoniFreddari et al. 1988, 1990). Hence a plausible general scenario seems to involve some initial period of OPC.As AD progresses, synaptic compensation no longer succeeds in maintaining the TSA (Bertoni-Freddari et al. 1990; DeKosky and Scheff 1990). In advanced AD cases, severe compensatory dysfunction has been observed (Buell and Coleman 1979; Flood and Coleman 1986; DeKosky and Scheff 1990). Young AD patients are likely to have high compensation capacities, and therefore can maintain an OPC strategy (k = 1, in Fig. 2) throughout the course of their disease. This will then lead to a rapid deterioration when the reserve of synaptic connections has been depleted. Indeed, young AD patients have been reported to have a rapid clinical progression (Heston et al. 1981; Heyman et al. 1983), accompanied by severe neuronal and synaptic loss (Hansen et al. 1988). A similar clinical pattern of rapid memory decline, already manifested with less severe neuroanatomical pathology, was found in very old patients (Huff et al. 1987).
Neural Network Modeling in AD
743
We propose that in these old patients, the rapid clinical decline results from the lack of compensatory capacities (k = 0, in Fig. 2), possibly of the kind observed by Buell and Coleman (1979) and Flood and Coleman (1986). Rapid cognitive decline characterizes a minority of AD patients. Most patients show a continuous gradual pattern of cognitive decline (Adams and Victor 1989; Katzman 1986; Katzman et al. 1988), taking place along a broad spun of synaptic deletion (DeKosky and Scheff 1990). As shown in Figure 2, this performance decline cannot be accounted for by any network employing fixed k compensation. Variable compensation, such as that defined by the gradually decreasing strategy, is needed to explain the memory decline observed in the majority of AD patients, as shown in Figure 4. The clinical state of some AD patients remains stable at mild to moderate levels for several years before finally rapidly decreasing (Cummings and Benson 1983; Katzman 1985; Botwinick et al. 1986). This can be accounted for by a "plateau" strategy whose performance, shown in Figure 4, stays at an approximately constant level over a large domain of deletion. Synaptic deletion and compensatory mechanisms play a major role also in the pathogenesis of Parkinson disease (Zigmond ef al. 1990; Calne and Zigmond 1991). The significant incidence of AD patients having accompanying extrapyramidal parkinsonian signs (Mayeux ef al. 1985; Stem et al. 1990) naturally raises the possibility that such patients may have a decreased synaptic compensatory potential in general (Horn and Ruppin 1992). The cognitive deterioration of these AD patients is faster than that of AD patients without extrapyramidal signs. This fits well with our proposal that severely deteriorated synaptic compensation capacity leads to an accelerated rate of cognitive decline in AD patients. This issue is still inconclusive because the I'D-AD combination may be a specific syndrome on its own. 4 Discussion
In accordance with the findings that neuronal loss in AD is less than 10% even at advanced stages (Katzman 19861, and that the synapse to neuron ratio is significantly decreased (Davies et al. 1987; Bertoni-Freddari et al. 1990), we have concentrated on studying the role of the synaptic changes. Simulations we have performed incorporating neuronal loss have shown similar results to those presented above. We conclude therefore that the important factors are indeed the number of synapses retained and the compensation strategy employed, whose interplay may lead to various patterns of performance decline. As any current neural model of human cognitive phenomena, our model necessarily involves many simplifications. The TF formal neurons are obviously a very gross simplification of biological neurons. As in
744
D. Horn et al.
most Hopfield-like ANNs, the network has no spatially specified architecture. For clarity of exposition of our main ideas, we have assumed that all compensation strategies are applied uniformly to all retained synapses. Our analysis also holds for nonuniform compensation, that is, when each remaining synaptic weight is multiplied by a random variable with mean value c and variance 02, since the same averages of the postsynaptic potentials are obtained (see Fig. 1 and the Appendix). Obviously, if the variance is too large, then no compensation strategy can be conceived of any more. Motivated by the biological evidence testifying to the sparsity of neural firing (Abeles et al. 19901, we have assumed a relatively small fraction p of firing neurons. Simulations performed with higher p values (e.g., 0.2) indicate that the results remain qualitatively the same. However, it should be noted that as p is increased the approximation of the network's overlap dynamics presented in the Appendix becomes less and less accurate. The variable compensation strategies that we have discussed rely on the fact that there is some span in the (k,d ) plane over which deterioration takes place, as shown in Figure 3. As N is increased, the width of the domain over which deterioration occurs keeps getting narrower, thus limiting the possibilities of maneuvering between deletion and compensation. Hence, one may claim that our conclusions, which are based on simulations of small scale networks, do not hold for the brain. One possible answer to this problem is that there may exist important modules in the brain whose function depends on the correct performance of just some thousands of neurons (Eccles 1981). For large cortical modules, this objection may be resolved by considering the effect of noise present in the brain. To account for the latter, any realistic paradigm of memory recall should entail the recognition of a spectrum of noisy inputs presented to the network. Figure 5 displays the performance of the network in the (k,d ) plane obtained via simulations with two distinct initial overlap values [rn'(O) = 0.8 and m'(0) = 0.951, together with the theoretical results for the infinite N limit. These results show that even in this limit, the corresponding performance curve always retains a finite width as long as the network processes input patterns with a broad range of initial overlaps. Consequently, the realization of variable compensatory strategies may indeed be feasible in the brain. The decline in the network's performance resulting from synaptic deletion is coupled with a decrease in the network's overall activity. This observation gives rise to the possibility that although being defined "globally" as "strategies," synaptic compensation may take place via local feedback mechanisms. Accordingly, the decreased firing rate of a neuron being gradually deprived of its synaptic inputs may enhance the activity of cellular processes strengthening its remaining synapses. This scenario seems to lead to fixed OPC compensation in a rather straightforward manner, but as synaptic deletion may be nonhomogeneous the effects of the resulting spatially nonuniform compensatory changes should be fur-
Neural Network Modeling in AD
745
1.o
0.8
0.6
0.4
0.2
0.0
d
Figure 5: The critical transition range in the ( k , d ) plane. The solid curves represent performance levels of 75 and 25%, with initial overlap m(0) = 0.8 (identical to Fig. 2). The dash-dotted curves represent performance levels of 75 and 25%, for initial overlap n ( 0 ) = 0.95. The dotted curves represent the theoretical results that follow from the analysis presented in the Appendix, delineating the estimates of when the correspondingbasins of attraction cease to exist in the infinite N limit. These curves lie close to the 25% lines of the simulations.
ther investigated. The nonvanishing width of the (k,d ) plane transition range shown above is essential for the feasibility of an ORC strategy, so that local mechanisms can “trace” the decreasing performance and “counteract” it before the performance collapses entirely. Finally let us comment on possible examinations and applications of our model. An ideal experiment would involve a series of consecutive measurements of synaptic strength and cognitive abilities. In light of obvious difficulties concerning such tests, we may have to resort to comparing biopsies and autopsies, as in DeKosky and Scheff (1990), preferably on the same patients. Our model demonstrates the importance of maintaining the TSA for the preservation of memory capacity and, therefore, mental ability of AD patients. This may suggest that future therapeutic efforts in AD should include an attempt to mobilize compensatory mechanisms facilitating the regeneration of synaptic connectivity.
D. Horn et al.
746
Appendix Qualitative features of our model can be derived from a simple analysis of the first iteration step. Starting with a state that is close to with overlap m'(0) we wish to find whether the network flows into the correct memory. Using the dynamics defined in the text we find for t = 1
['
=
8 [ ~ (-l d)([! - p)p(l - p)m'(O) + N - T ]
(A.1)
where we have separated the signal from the noise term. The latter has zero average, (N)= 0, and variance of (N2) = c2(1 - d)p2(1 - ~ ) ~ a s ( O ) , where s(0) = P[S,(O) = 11 = l-p-m1(0)+2pm'(0). In view of the gaussian noise term we write the probability in terms of an error-function,
This results in the following expression for the first iteration: m'(1) =
=
1
((1 - p)P([! = l)P[Si(l) = 1"; P(1 -P ) - pP([; = O)P[Sj(l) = lI&!= 01) ~
P[Sj(I) = 1[;
= erf
=
I]
= 11- P[Si(I) = 11~;= 01
(1- p)m'(O)p(l - p)C(l - d) - T
[ Jac2(1 - d)s(O)p2(1 - p ) 2
[
1 1
- erf (-p)m'(O)p(l - p ) d l - d) - T Jac2(1 - d)s(O)p2(1 - p ) 2
(A.3)
In the limit m'(0) -+ 1 one finds the maximal value of equation A.3 to be obtained for the following choice of the optimal threshold:
T' = ~ (- ld)p(l - p)(l - 2p)/2
(A.4)
For c = 1 and d = 0 this coincides with the choice we have made. Moreover, as long as c=-
1
I-d
(A.5)
which was defined as the OPC strategy, T remains optimal. This fact was expressed graphically in Figure 1. The two gaussian distributions in this figure correspond to the two terms in equation A.3.
Neural Network Modeling in AD
747
In the simulations we have looked for the cases in which the system converged onto the correct fixed points. This involves iterating the equations of motion, which is in general different from iterating expressions like equation A.3 because of possible correlations between the different time steps. Nonetheless we may think of the iteration of equation A.3 (replacing the labeling 0 and 1 by n - 1 and n ) as a rough estimate for the strongly diluted infinite system (Evans 1989). Starting out with different values for m’(0) we find the dotted curves in Figure 5, which show the borderlines between convergence and nonconvergence to the correct memory. An alternative to this derivation is to use the replica symmetry assumption. We have carried out such an analysis (Herrmann et al. 1992). The results are similar, though not identical, to the ones reported above.
References Abeles, M., Vaadia, E., and Bergman, H. 1990. Firing patterns of single units in the prefrontal cortex and neural network models. Network I,13-25. Adams, R. D., and Victor, M. 1989. Principles of Neurology. McGraw-Hill, New York. Amit, D. J.1989. Modeling Brain Function: The World of Attractor Neural Networks. Cambridge University Press, Cambridge. Bertoni-Freddari,C., Meier-Ruge, W., and Ulrich, J. 1988. Quantitative morphology of synaptic plasticity in the aging brain. Scanning Microsc. 2, 1027-1034. Bertoni-Freddari, C., Fattoretti, P., Casoli, T., Meier-Ruge, W., and Ulrich, J. 1990. Morphological adaptive response of the synaptic junctional zones in the human dentate gyrus during aging and Alzheimer’s disease. Brain Res. 517,69-75. Botwinick, J., Storandt, M., and Berg, L. 1986.A longitudinal behavioral study of senile dementia of the Alzheimer type. Arch. Neurol. 43, 1124-1127. Buell, S.J.,and Coleman, I? D. 1979. Dendritic growth in the aged human brain and failure of growth in senile dementia. Science 206, 854-856. Calne, D. B., and Zigmond, M. J. 1991. Compensatory mechanisms in degenerative neurologic diseases. Arch. Neurol. 48,361-363. Canning, A., and Gardner, E. 1988. Partially connected models of neural networks. 1.Phys. A: Math. Gen. 214,32753284. Cummings, J. L.,and Benson, D. F. 1983. Dementia: A Clinical Approach. Butterworths, London. Davies, C. A., Mann, D. M. A., Sumpter, P. Q., and Yates, P. 0.1987.A quantitative morphometric analysis of the neuronal and synaptic content of frontal and temporal cortex in patient with Alzheimer’s disease. 1.Neurol. Sci. 78, 151-164. DeKosky, S. T., and Scheff, S. W. 1990. Synapse loss in frontal cortex biopsies in Alzheimer’s disease: Correlation with cognitive severity. Ann. Neurol. 27(5), 457-464.
748
D. Horn et al.
Drachman, D. A., O'Donnell, B. F., Lew, R. A,, and Swearer, J. M. 1990. The prognosis in Alzheimer's disease. Arch. Neurol. 47, 851-856. Eccles, J. C. 1981. The modular operation of the cerebral neocortex considered as the material basis of mental events. Neuroscience 6, 1839-1855. Evans, M. R. 1989. Random dilution in a neural network for biased pattern. J. Phys. A: Math. Gen. 22,2103-2118. Flood, D. G., and Coleman, P. D. 1986. Failed compensatory dendritic growth as a pathophysiological process in Alzheimer's disease. Can. J. Neurol. Sci. 13,475-479. Fuster, J. M.,and Jervey, J. P. 1982. Neuronal firing in the inferotemporal cortex of the monkey in a visual memory task. J. Neurosci. 2(3), 361-375. Hansen, L. A., DeTeresa, R., Davies, P., and Terry, R. D. 1988. Neocortical morphometry, lesion counts, and choline acetyltransferase levels in the age spectrum of Alzheimer's disease. Neurology 38,48-54. Hemnann, M., Horn, D., Ruppin, E., and Usher, M. 1992. Variability in the pathogenesis of Alzheimer's disease: analytical results. To appear in the Proc. ICA""92, September, Brighton, UK (in press). Heston, L. L., Mastri, A. R., Anderson, V. E., and White, J. 1981. Dementia of the Alzheimer type: clinical genetics, natural history, and associated conditions. Arch. Gen. Psychiat. 38, 108!%1090. Heyman, A., Wilkinson, W. E., Hurwitz, P. J., Schmechel, D., Sigmon, A. H., Weinberg, T., Helms, M. J., and Swift, M. 1983. Alzheimer's disease: Genetic aspects and associated clinical disorders. Ann. Neurol. 14(5), 507-515. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Horn, D., and Ruppin, E. 1992. Extra-pyramidal symptoms in Alzheimer's disease: a hypothesis. Med.Hypotheses 39(4), 316-318. Huff, F. J., Growden, J. H., Gorkin, S., and Rosen, T. J. 1987. Age of onset and rate of progression of Alzheimer's disease. J. Am. Geriatr. SOC. 35,27-30. Jansen, K. L.R., Faull, R. L. M., Dragunow, M., and Synek, 8. L. 1990. Alzheimer's disease: Changes in hippocampal N-methyl-Daspartate, quisqualate, neurotensin, adenosine, benzodiazepine, serotonin and opoid receptors-an autoradiographic study. Neuroscience 39(3), 613-627. Katzman, R. 1985. Clinical presentation of the course of Alzheimer's disease: The atypical patient. Interdiscipl. Topics. Gerontol. 20, 12-18. Katzman, R. 1986. Alzheimer's disease. N. Engl. J. Med. 314(15), 964-973. Katzman, R., et al. 1988. Comparison of rate of annual change of mental status score in four independent studies of patients with Alzheimer's disease. Ann. Neurol. 24(3), 384-389. Koscielny-Bunde, E. 1990. Effect of damage in neural networks. 1.Statist. Phys. 58,1257-1266. Kosik, K. S. 1991. Alzheimer's plaques and tangles: Advances in both fronts. TINS 14,218-219. Mayeux, R., Stem, Y., and Spanton, S. 1985. Heterogeneity in dementia of the Alzheimer type: Evidence of subgroups. Neurology 35,453-461. Miyashita, Y.,and Chang, H. S. 1988. Neuronal correlate of pictorial short-term memory in the primate temporal cortex. Nature (London) 331,68.
Neural Network Modeling in AD
749
Selkoe, D. J. 1987. Deciphering Alzheimer’s disease: The pace quickens. TINS 10, 181-184. Stern, Y., Hesdorffer, D., Sano, M., and Mayeux, R. 1990. Measurement and prediction of functional capacity in Alzheimer’s disease.’ Neurology 40,8-14. Tsodyks, M. V., and Feigelman, M. V. 1988. The enhanced storage capacity in neural networks with low activity level. Europhys. Lett. 6, 101-105. Zigmond, M. J., Abercrombie, E. D., Berger, T. W., Grace, A. A., and Stricker, E. M. 1990. Compensations after lesions of central dopaminergic neurons: Some clinical and basic implications. TINS 13, 290. Received 1 July 1992;accepted 14 January 1993.
This article has been cited by: 2. Andre Fischer, Farahnaz Sananbenesi, Xinyu Wang, Matthew Dobbin, Li-Huei Tsai. 2007. Recovery of learning and memory is associated with chromatin remodelling. Nature 447:7141, 178-182. [CrossRef] 3. Greg J. Siegle, Michael E. Hasselmo. 2002. Using connectionist models to guide assessment of psycological disorder. Psychological Assessment 14:3, 263-278. [CrossRef] 4. Leif H. Finkel. 2000. NEUROENGINEERING MODELS OF BRAIN DISEASE. Annual Review of Biomedical Engineering 2:1, 577-606. [CrossRef] 5. Jean-Marc Fellous, Christiane Linster. 1998. Computational Models of NeuromodulationComputational Models of Neuromodulation. Neural Computation 10:4, 771-805. [Abstract] [PDF] [PDF Plus] 6. Asnat Greenstein-Messica , Eytan Ruppin . 1998. Synaptic Runaway in Associative Networks and the Pathogenesis of SchizophreniaSynaptic Runaway in Associative Networks and the Pathogenesis of Schizophrenia. Neural Computation 10:2, 451-465. [Abstract] [PDF] [PDF Plus] 7. Joseph T. Devlin , Laura M. Gonnerman , Elaine S. Andersen , Mark S. Seidenberg . 1998. Category-Specific Semantic Deficits in Focal and Widespread Brain Damage: A Computational AccountCategory-Specific Semantic Deficits in Focal and Widespread Brain Damage: A Computational Account. Journal of Cognitive Neuroscience 10:1, 77-94. [Abstract] [PDF] [PDF Plus] 8. Eytan Ruppin , James A. Reggia . 1995. Patterns of Functional Damage in Neural Network Models of Associative MemoryPatterns of Functional Damage in Neural Network Models of Associative Memory. Neural Computation 7:5, 1105-1127. [Abstract] [PDF] [PDF Plus] 9. Sergio A. Cannas . 1995. Arithmetic PerceptronsArithmetic Perceptrons. Neural Computation 7:1, 173-181. [Abstract] [PDF] [PDF Plus] 10. D. Horn , E. Ruppin . 1995. Compensatory Mechanisms in an Attractor Neural Network Model of SchizophreniaCompensatory Mechanisms in an Attractor Neural Network Model of Schizophrenia. Neural Computation 7:1, 182-205. [Abstract] [PDF] [PDF Plus] 11. M. R. W. Dawson, A. Dobbs, H. R. Hooper, A. J. B. McEwan, J. Triscott, J. Cooney. 1994. Artificial neural networks that use single-photon emission tomography to identify patients with probable Alzheimer's disease. European Journal of Nuclear Medicine 21:12, 1303-1311. [CrossRef]
Communicated by Geoffrey Hinton
Supervised Factorial Learning A. Norman Redlich The Rockefeller University, 1230 York Avenue, New York, NY 10021 USA
Factorial learning, finding a statistically independent representation of a sensory "image"-a factorial code-is applied here to solve multilayer supemised learning problems that have traditionally required backpropagation. This lends support to Barlow's argument for factorial sensory processing, by demonstrating how it can solve actual pattern recognition problems. W o techniques for supervised factorial learning are explored, one of which gives a novel distributed solution requiring only positive examples. Also, a new nonlinear technique for factorial learning is introduced that uses neural networks based on almost reversible cellular automata. Due to the special functional connectivity of these networks-which resemble some biological microcircuitslearning requires only simple local algorithms. Also, supervised factorial learning is shown to be a viable alternative to backpropagation. One significant advantage is the existence of a measure for the performance of intermediate learning stages. 1 Introduction Inhibition of neurons by each other, both laterally and temporally, is ubiquitous in the brain, and shows up most obviously in sensory processing as sensitivity to both spatial and temporal contrast. For example, in the retina the center-surround mechanism and a similar temporal differentiating mechanism produce outputs that are most prominent at locations and times where luminance is varying. Another way to characterizethese outputs is that they are signaling only the least predictable portion of the sensory message. In other words, they have stripped the sensory signals of their redundancy. Barlow has proposed such redundancy reduction as one of the main purposes of inhibition in both the retina and neocortex (Barlow 1961, 1989, 1992). Theories based on this idea have explained both single cell and psychophysical data, with the analyses especially successful for the retina (Srinivisan et al. 1982; Field 1987, 1989; Linsker 1988, 1989; Atick and Redlich 1990, 1992; Atick et al. 1992, 1993). Barlow's thesis is that by identdymg redundancy between different afferent signals, the brain discovers their statistical relationships, and this statistical knowledge is critical for object recognition and associative learning. More specifically Neurd Computation 5, 750-766 (1993)
@ 1993 Massachusetts Institute of Technology
Supervised Factorial Learning
751
he proposes that a representation where the individual signals, for example, neural outputs, are statistically independent+ factorial code-is the most efficient one for storing statistical knowledge. Although learning a factorial code can be very difficult, it can be accomplished through a step by step reduction in redundancy (Barlow and Foldiak 1989; Atick and Redlich 1991; Redlich 1992). At successive stages the code becomes more factorial and in this way statistical knowledge is acquired. Here I shall refer to this as factorial learning. It can be an enormously effective nonparametric strategy for measuring the statistics of an ensemble (Redlich 1992). There remains in my opinion one major conceptual gap in the arguments in favor of factorial coding and learning in the brain: How can this type of processing actually help solve difficult cognitive problems such as pattern recognition? In this paper, I explore one possible answer to this question, by demonstrating that factorial learning can incorporate supervision. In principle this amounts to using factorial learning to find the probabilities P(s, u ) of both an unsupervised signal u, such as the retinal image, plus an explicit supervising signal s. However, in practice the way supervision is implemented makes a big difference in the learning efficiency. Here, two very different implementations are explored, each with different advantages. The first is most similar to traditional supervised learning. The second produces a novel distributed representation of P ( s , u ) by using the redundancy in u to suppress the network output most when u corresponds to the desired concept s. Also, this network learns using only positive examples of s. In this paper I also introduce a new approach to factorial learning using almost reversible cellular automata (ARCA). This makes possible completely nonlinear coding as needed for example to solve binary problems. It has been the lack of such nonlinear techniques that has been one of the greatest obstacles to applying factorial coding to a theory of the neocortex, as was done for the retina in Atick and Redlich (1990,1992). One of the nicest features of ARCA networks is their similarity to biological lateral inhibition circuits, as seen, for example, in the retina, olfactory bulb, and visual cortex (Shepherd 1988, 1990). They thus provide one explanation for the purpose of this inhibition, since in ARCA networks some neurons attempt to predict and then shut down other neurons in order to remove redundancy. Moreover, factorial learning using ARCA networks is a concrete example where restricting the functional connectivity of the network allows the use of very simple neural learning algorithms. Beyond the potential biological applications, supervised factorial learning is a genuine alternative (or complement) to backpropagation for practical applications. The primary advantage is that it comes with a measure for the performance of intermediate stages. No reference is made to the final stage output and there is no need to be stuck with a fixed network connectivity. In this sense the approach here is most similar to
752
A. Norman Redlich
Cascade-Correlation (Fahlman and Lebiere 19901, as will be discussed in Section 4. A further advantage is that supervised factorial coding is a broad theoretical framework that encompasses both of the ARCA implementations described here as well as many others including the linear algorithms in Atick and Redlich (1991) and the nonlinear bound state methods in Redlich (1992). The paper is organized as follows: in Section 2 the general theory of factorial learning-both supervised and unsupervised-is outlined, while in Section 3 ARCA networks are described and neural learning algorithms derived. Section 4 then introduces supervision; it is broken into two subsections to discuss the two different implementations mentioned above. To be as concrete as possible, and also to test whether supervised factorial learning is a viable alternative to backpropagation, I revisit in both parts of Section 4, some classical backpropagationproblems (Rumelhart et al. 1986) and solve them. Some issues of generalization are also discussed.
2 Factorial Learning
The goal of factorial learning is to find successivelybetter approximations to the joint probabilities P(u1,u 2 , . . .)-to be denoted P(u)-of a set of signals ui. The difficulty is that the number of joint probabilities can be enormous: if there are n signals Ui, each with N gray levels the number is N".Instead of directly attempting to measure these N" probabilities, one would like to determine P ( u ) from a much smaller set of measurements, for example by measuring only the n probabilities Pi(ui). This can be achieved by approximating the joint probabilities P ( u ) by the product of individual probabilities P i ( U i )
that is, by assuming statistical independence. Of course, for the original input u this factorial assumption is likely to be wildly wrong. The goal of factorial learning however is to map the signals u to a new representation u', which is statistically independent. This is accomplished in stages, as shown in Figure 1, so that ultimately one finds an encoding for which equation 2.1 is a good approximation. But what is meant by a better approximation to a factorial code? To quantify the improvements at each stage, it is necessary to define a measure of statistical independence for each representation of the signal u. This learning meusure should also be a function only of the local probabilities Pi(ui). A measure that does this is the sum of the individual
Supervised Factorial Learning
Pl(U"1)
Plh(U"2)
753
P113(u1'3)
P"4(u"4)
Figure 1: A feedforward example of a multilayer network for factorial learning. Each layer L, L', L" . . . of neurons gives a complete representation, with the outputs ui,u;, uy . .. of successive stages becoming more statistically independent. entropies'
where the second sum is over all possible gray scale values of ui. Given one additional constraint, this measure is minimal only when the code is factorial, so reducing E in stages improves the approximation 2.1. Of course one cannot get something for nothing: one cannot learn the global P(u) using a function E of only the local Pi(ui). This is why it is also necessary to impose the global constraint, H[P(u)]= H'[P'(u')],of no total information loss from u to u', where H[P(u)]is the total entropy, With this constraint, minimizing E produces a factorial code due to the well known theorem CiHi 2 H, with equality only when the code is factorial. But doesn't this require knowledge of P(u),just what we do not know? Absolutely not, since this constraint can be replaced by imposing rmersibility on the map u -, u': ui
= u':'(u')
Exists
(2.3)
'Note that minimizing E in equation 2.2 is referred to as redundancy reduction even though it does not strictly speaking reduce the total Shannon redundancy (Shannon and Weaver 1949). It does reduce all of the redundancy between individual signal elements, which is important here.
A. Norman Redlich
754 ~
~~
u'o
u'1
d3
u'2
u'4
u's
..
:" uo
u1
u3
u2
u4
us
Figure 2 An example of a one-dimensional ARCA network where inputs ui can be inhibited only by inputs to their left, in order to ensure reversibility: equations 3.1 with p = 3. The X neurons implement the modular subtraction u{ = ui = fi(u),while the f i interneurons attempt to predict and shut down ui in order to reduce redundancy. Reversibility is sufficient-though not always necessary-to ensure no information loss. I now introduce a class of maps for which reversibility is ensured by restricting the functional connectivity of a "neural" network. 3 Reversible Coding: ARCA
It is simplest to describe an "almost" reversible cellular automata (ARCA) network using a one-dimensional example. [These automata were called almost reversible in Eriksson et al. (1987)to distinguish them from the usual reversible automata (Margolus 1984); however, in this paper they produce fully reversible maps, and I apologize for any confusion this terminology may cause.] Suppose there are n signals U i , with i = 0, . . . n 1,which are to be mapped reversibly to a new set of n signals u:, as shown in Figure 2. Also, assume that the signals have N gray levels. Then one possible ARCA rule/network, as shown in the figure for p = 3, is
- fi(u+, ui -fi(uo, ~
..
u; = ui
U ~ - ~ + I ,uj-1) .
U; =
1 , . Ui-1)
..
mod N, mod N,
for i 2 p for i < p
(3.1)
where 0 < p < n. For continuous signals this rule is replaced by ordinary subtraction. In Figure 2, the "X neurons'' perform the modular arithmetic,
Supervised Factorial Learning
755
which in the binary case is XOR, hence the X. They are shown as distinct units in the figure, but could be combined into a single unit with the f i "interneurons." It is not difficult to convince oneself that such a rule is completely reversible2 for arbitrary functions f i , assuming the f i are quantized to have N gray levels. This follows because modular subtraction is reversible, and to compute any given ui from the u: one need only start from i = 0 and work deductively up to i. Also, it is not difficult to generalize this rule, for example, by working from i = n downward. Furthermore, ARCA networks need not be asymmetrical. For example, one can take ui = ui -fi(u,) for i even and j odd, and u: = ui for i odd. All that is needed for reversibility is that the fi(u) for any given i depend only on those uj that can be calculated first without knowing ui. ARCA networks can also be cascaded to compute larger classes of reversible functions. For linear coding of continuous signals the rule in equation 3.1 becomes simply u: = Cj Wjjuj with W ,a matrix of the form Wij = 1 and Wi,.=0 for j > i. Such a triangular matrix automatically has det(W) = 1, so its inverse is guaranteed to exist. Also, cascading different triangular matrices can produce arbitrary linear transformations with determinant one. 3.1 Similarities to Local Biological Circuits. Comparing the ARCA network in Figure 2 to microcircuits in the brain one finds some similarities. First, in most brain circuits there are neurons like the X neurons in the ARCA network, which provide straight-through transmission of afferent signals (Shepherd 1988,1990). These are, for example, the bipolar cells in the retina, the mitral cells in the olfactory bulb, the relay cells in the LGN,and the pyramidal neurons in olfactory, hippocampal, and neocortex. In the ARCA network the purpose of the X cells is to preserve information flow, and it seems likely that this is also partly the role of "straight-through" neurons in the brain, as is perhaps clearest for the bipolar cells. Second, the f interneurons in the ARCA network provide another major type of interaction seen in basic brain circuits: inhibitory horizontal interactions. In Figure 2 these are of the feedforward type like the horizontal cells in the retina and periglomerular cells in the olfactory bulb; such inhibition is also seen in the cortex and and there is evidence for it in the LGN (Shepherd 1990). The nonbiological aspect of the network in Figure 2 is its asymmetry, but as mentioned above, there are also symmetrical ARCA networks, though they require skipping some inputs. In any case, I do not argue that the brain's microcircuits are precisely ARCA networks, but that functionally there are significant similarities. *The original ARCA rule in Eriksson et al. (1987) includes only the first line in equation 3.1 so in an infinite string there are always p unknown signals making the code almost reversible. However, in the infinite system there is no information loss because the missing p signals carry negligible information.
A. Norman Redlich
756
3.2 Factorial Learning Algorithms for ARCA Networks. In applying an ARCA rule/network such as equation 3.1 (Fig. 2) to factorial learning, the functions fi(u) are learned by minimizing E in equation 2.2. In principle this involves measuring the probabilities Pi(ui)-but not P(u)after each change in fi(u). However, in practice it is usually sufficient to replace E by a function such as C i ( U i -fi)2, which requires no probability measurements. For the binary case and for the case of continuous gaussian signals this is guaranteed to lower the Hi [see below for the binary case and Atick and Redlich (1990) for the gaussian]. For other types of signals it should also lower E, but I have not yet attempted a proof. I should emphasize that such simplification may not lead to lowering E optimally at each stage, but with many stages this is not necessary. To see why minimizing Ci(Uj-fi)2 can be sufficient, consider the very simple problem of factorizing a pair of signals uo and u1 as depicted in Figure 2, ignoring the neurons with i > 1. Here ub = uo and u{ = u1 - f1 (ug), and furthermore I assume the two signals are binary. In this case, ub = uo so we can take E’ = -P log(P‘)- (1 - P‘) log(1 - P’) where P is the probability that u; = 1. Also, because the signals are binary P’ = ( u ; ) = ((u:)~)= ((u1 - f ~ ( u o ) ) ’ )so reducing the square difference between u1 and its prediction fi(u0) reduces P’ and thus E’. This is true even if P’ is initially greater than 1/2 since then, by minimizing P’, a function of the complement 1 - u1 is found that has P’ < 1/2. Going back to the many neuron problem, it is straightforward to find an explicit learning algorithm for the binary case (as well as other cases). Take the functionsfi(uj) to be sigmoids ~ ( uof) the linear s u m of inputs u.: fi(Uj) = U ( w i j *U j - ti). Then using gradient descent to minimize C ( U i - f i f i gives the update
(3.2) dwij = O(ui - f i ) ( l -fi)fiuj for the synapse wv and similarly for the threshold ti. Here r) is the usual
gradient descent parameter; I have found learning to be insensitive to its value. Also, because this is basically a one-layer algorithm it tends to converge very rapidly to both a lower sum of squares and a lower E , equation 2.2. I have never found it to increase E . Note also that in the linear case where ui = Wiju,, minimizing the sum of squares Ci(ui)2through gradient descent gives
xi
d Wij = -QU{U~
(3.3)
which is a simple anti-Hebb rule for the output ui. This rule is simpler than the linear rule in Atick and Redlich (1991) because there the reversibility constraint was added explicitly in E. 4 Supervised Learning
In supervised learning there are two signals, the unsupervised “image” ui plus a set of supervisor signals sj. The sj may be coincident with
Supervised Factorial Learning
757
S"'
S"
S'
U
S
Figure 3 Schematic of ARCA networks for supervised learning of the more "traditional" type. Here only the redundancy between u and s is reduced with the fi interneurons functions of u alone, and the X neurons performing s[ = si - fi(u). At successive stages only that portion of the s signal, which could not be predicted, is passed forward, simplifying the learning at the next stage, until the problem is solved. or occur before or after the signals ui: the index i may include both space and time. The goal is to factorially learn P(s, u ) from which can be calculated P(slu) = P ( s , u ) / P ( u ) ,which is the probability that u belongs to the concept s. For this it may be useful for the early stages of sensory processing to have already learned the background statistics P(u). In applying factorial learning to the supervised case, however, there is a major issue of implementation to be resolved. The problem is that during supervised learning the input consists of pairs u,s, but after learning only the signal u is given since the aim is to predict s. So if supervised factorial learning were implemented naively one would need to cycle through all possible s serially to find the one with largest P(s1u). Though the brain does perform serial searches, it also seems capable of solving many problems in parallel. I therefore give below two approaches to supervised factorial that avoid the need for serial processing. 4.1 "Traditional" Oracle-Type Approach. One way to avoid serial processing is to exploit the fact that an ARCA network factorizes by using one set of signals to predict others, with one set in effect acting as the "oracle" for the others. To apply this to the supervised case, one uses u as the predicting and s as the predicted signals. This is shown schematically in Figure 3 where the functions fi(u) attempt to predict the
758
A. Norman Redlich
signals s;. The outputs s: of the XOR neurons are the unpredicted error signals, which at the next learning stage the fi' attempt to predict. Since the error signals at successive stages contain less and less information, they become easier to predict. If at first, for example, the problem is not half plane separable, it becomes more and more so, until at the final stage it can be learned by a perceptron. Cycling through possible s is not required in Figure 3 since given a u, the network has a direct prediction of the signals Si as the modular sum of the f i from all stages. This is because when learning is complete, the final stage outputs s r remain off for all pairs u , s so Therefore, if instead of equation 4.1, the unknown inputs si are artificially set to zero, then the outputs sy will equal (minus) the predicted si. 4.1 .1 Some Classical Backpropagation Problems Revisited. For illustration, I now revisit some classical backpropagation problems (Rumelhart et al. 1986). For every one of these, supervised factorial leaning finds a solution without false minima difficulties, although as in all gradient descent problems q must not be too large. Also, the solutions here use fewer or the same number of learning interneurons f ; as the backpropagation solutions. The XOwParity problem is the usual paradigm for multilayer learning since even XOR is not perceptron learnable. Here, this is perhaps not the best example since ARCA networks already include some XOR neurons. However, the learning elements in the network are the f i interneurons for which learning XOR/Parity remains nontrivial. More importantly, XOR is the simplest example that illustrates how factorial leaming in distinct stages differs from backpropagation: An ARCA network for XOR is shown in Figure 4a. In the first learning step the f ( u 0 , u l ) neuron attempts to predict s, but cannot do so completely because f(u) is a perceptron. Instead, it shuts down s as much as possible, leaving for example s' = 1 for u = (1,O) only, rather than also for (0,l). The problem at the next stage is then much simpler since s' can be predicted by a perceptron. Although there are separate XOR/modular neurons in Figure 4a for each stage, they can be combined into a single "parity" neuron as in Figure 4b. This figure shows a network that solves both the Parity problem and the Symmetry problem (s = 1iff UO,ul, u2 = us,u4,ug) by learning the functions f , f ' , f" . . . in stages as was done for XOR in Figure 4a. These solutions use the same number of learning f neurons as used by backpropagation in Rumelhart et al. (1986). Another problem (see Fig. 4c), that was solved using the same number of neurons as backpropagation was the Negation problem (SO,s1, s2 equals the complements of UO,UI,u2 when u3 = 1 and simply equal UO,ul,u2 otherwise). Here all the f; were learned in one step.
759
Supervised Factorial Learning
S'
4a
t
4b
.. U
uo
S
4c
4d
f7 s'o 4
s'1
s'2
4
4
so
s1
s2
U
U
U
so
U
U
U
u = uo u1 u2 u3
u = uo u1 u2 u3
Figure 4: ARCA networks used to solve some classical backpropagation problems: XOR in a, Parity and Symmetry in b, Negation in c, and Addition in d. See text for details.
A more interesting result was discovered for the Addition problem since redundancy between the si could be exploited to find a more compact solution than in Rumelhart et al. (1986). In this problem the two bit binary number UO,u1 is added to U P ,243 giving the output SO, SI,SP. However, since the sums do not occur with equal frequency, the si are not statistically independent. Therefore, for example, knowing SO can im-
760
A. Norman Redlich
prove the prediction of sl. One can exploit this by first learning fo(u) (which predicts s1) and then using it with u to learn fi(u,fo), and then using them both to learnfz(u,fo,fI). This is illustrated in Figure 4d, which turns out to be a three neuron solution-in contrast to the five neuron solution in Rumelhart et al. (1986). I have also solved the classical T,C problem using the approach of this section, although it and the Symmetry problem above are far more elegantly solved in Section 4.2. The cost of using the the “traditional” approach here is the need for many learning stages. For example the T,C problem on a 10 x 10 grid required 37 stages, although this represents only 37 neurons, fewer than needed for the backpropagation solution. Also the learning here was accomplished without assuming translation invariance as in Rumelhart et a1 (1986). 4.1.2 Comparison with Cascade-Correlation. As mentioned in the introduction, the cascade-correlation approach of Fahlman and Lebiere (1990) is similar to the approach just described. The main points of similarity are that they both (1) build the network architecture as needed using relatively simple one layer learning algorithms at each step, and (2) use the output error signals at each step. However, the way in which neurons are added to the network differs, since in cascade-correlation, both the “f” and “ X neurons have adaptable links, which are learned in a two-stepprocedure. This is different from the simpler one-step procedure here using only modifiable ‘7” synapses. The more significant difference between the two approaches is that the algorithm described above is but one of many possible implementations of supervised factorial learning. This is evident from the implementation in the following section, which bears no resemblance to cascadecorrelation. Moreover, factorial coding can be learned using algorithms such as the ”bound state” technique in Redlich (19921, which are very different from the ARCA ones used here. Also, the end product of supervised (and unsupervised) factorial learning offers the additional benefit of actually learning an approximation to the probablities P ( u ) and P(u, s). In terms of performance, only the Parity problem has been tried using both cascade-correlation and the above algorithm and I found that the performance of the latter is of the same order as the 325 epochs for &bit Parity, with both approaches showing an improvement over backpropagation. However, it is in the implementation below that I find supervised factorial learning shows the greatest improvement over backpropagation (e.g., a reduction from 75,000 to 50 learning steps for the Symmetry problem). 4.2 Positive Example Learning Approach. Just as the Addition problem illustrates the usefulness of exploiting redundancy within s, there are many problems where redundancy in u can be exploited. Such problems
Supervised Factorial Learning
761
include the Symmetry and T,C problems mentioned above, as well as the Shifter problem in Hinton and Sejnowski (1986), and any one- or twodimensional translation problem-to save space, only the Symmetry and T,C solutions are given. These all use a representation u that is highly redundant. They are also closest to the type of problems the brain needs to solve. One way to fully exploit the redundancy in u, while avoiding the need for serial processing, is to build separate ARCA networks N* for each possible s*, with each network in parallel calculating its P(uls*). After this fully parallel processing, the outputs of the different N' can be compared to see which has the largest P(s*]u)for a given u. In practice, see below, the comparison can be as simple as seeing which of the N' responds least to the input u. The learning process can be further simplified by calculating P(u Is*) = P(u,s*)/P(s*) in place of P(u,s*).The P(s*) can be calculated separately; in the examples here it can be completely ignored since for them P(s*) is flat. The beauty of having the N' calculate P(uls*)is that then the learning set consists only of positive examples of each concept S* (as in "maximum likelihood" learning, in contrast to the discriminative learning of Section 4.1). Each N* therefore learns the statistical structure of members of its s*, and in this way learns what makes S* unique. 4.2.1 The Symmetry Problem. To illustrate this type of learning, I return to the Symmetry problem defined in Section 4.1.1. Since P(s* = llu) = 1-P(s* = Olu) only a single network that calculates P(uls* = 1) is needed, as shown in Figure 5. Here the learning set consists of only the z3 = 8 positive examples of symmetric inputs; the 64 - 8 negative examples are ignored. As a result the f i in Figure 5 learn to factorize P(uls*) in fewer than 50 total steps as opposed to the 75,000 used in Rumelhart et al. (1986). Following this learning, the ARCA solution is ub = ui = uh = 0 for all symmetric patterns, while u: = ui for i > 2. This is because knowing that a pattern is symmetric allows the network to predict and thus shut down UO,ul1u2 based on 243, ~4,145.On the other hand, since the network always predicts symmetry, it makes "false" predictions when shown nonsymmetric patterns. Then at least one of the ubl ui,uh will be turned on, with more on as the pattern deviates more from symmetry. So if any of the ui = 1 for i 5 2 we know the pattern must be nonsymmetric. 4.2.2 Generalization and Multiple Networks. In the Symmetry problem the network in Figure 5 correctly identifies as nonsymmetric all 56 patterns it has never seen. This is because the Symmetry problem is somewhat special: for it the learning set contains all possible values of the subset of inputs u3, u4,1.45, used to predict uo,ul, u2. Therefore, even for nonsymmetric patterns the network makes unambiguous predictions for
A. Norman Redlich
762
u'o
uo
U'l
u'2
u1
u2
u'3
u3
u'4
U'J
u4
us
s=l
Figure 5 ARCA network for positive example learning of the Symmetry problem. Knowing that patterns are symmetric allows the network to use the values of ~ 3 ~ ~ to4 predict , ~ s and completely shut off U O , U ~ , for U ~ symmetric patterns. For nonsymmetric patterns, which the network has never seen, it correctly generalizes by turning on at least one bit of UO,u1 ,u2. uo,u l , u2. In this sense the learning set is complete with respect to this
particular ARCA network. Problems that are complete in this sense are not difficult to find: they include the Shifter problem of Hinton and Sejnowski (1986). On the other hand, when the learning set-even if it includes all members of s*-is not complete for a specific ARCA network then some nonmembers of the concept s* may be falsely identified as members of s*: false-positive errors. However, such 'false-positive errors are rare. For example, in the symmetry problem to falsely identify a nonsymmetric input as symmetric the network would need to "correctly" predict and shut down all three UO,u1, u2. One way to reduce falsepositive errors, which works extremely well, is to use more than one ARCA network for each concept and then average the P(uls*). The different ARCA networks, using the same learning set, rarely make the same generalization errors because each seeks out different sources of redundancy in u. 4.2.3 The T,CProblem. A relatively simple example where falsepositive errors can occur is the classical T,C problem in Rumelhart et al. (1986) (T and C can occur at any location and at one of four rotations). This will
Supervised Factorial Learning
763
@)
Figure 6 Examples of some two-dimensional ARCA networks. The crosshatched inputs are used to predict the black input in a way that guarantees reversibility.
be discussed at the end of this section after I first explain how factorial learning is applied in this case. For two-dimensional problems there exist a greater variety of ARCA networks, some of which are shown in Figure 6. In Figure 6a the two-dimensional array at each learning stage is divided into ”even” and “odd subarrays: a checkerboard. At each learning stage either the odd inputs are predicted by their even neighbors or vice versa. Another option is Figure 6b where all inputs in the lower left quadrant are used for the prediction; any other quadrant can also be used. Also all inputs to one side of a half plane drawn through the predicted input may be used as in Figure 6c. In every case these arrangements ensure reversibility. For the T,C problem only a P(ulT) network is needed since what is not a T must be a C. The learning set then consists only of positive examples of the letter T. Initially the letter T has five bits on as shown in Figure 7a. However, some of these bits are redundant, since knowing that the letter is T, no more than three bits are needed to indicate position and orientation. For example, by learning in stages (it took 5 to 6 stages) using checkerboard networks (Fig. 6a), the number of bits on was reduced to three as shown for the T in Figure 7e, with corresponding results for the other three rotations. [Fewer bits on indicates higher P(ulT) because P(ulT) is given by the product of individual bit probabilities, and P(bit on) << P(bit off).] Actually this T network achieved more than the ability to distinguish T from C. It also discovered what makes a T different from all other concepts. Thus when this T network is shown examples of other 5 bit letters such as L and X in addition to C (Fig. 7bd),it also turns on more bits for them than for a T (Fig. 7f-h). One may also build separate networks for the other letters, and then
764
A. Norman Redlich
Figure 7 The inputs (a)-(d) and outputs (e)-(h) of an ARCA network that was used to learn P(ulT) by being shown only Ts at all positions and at four rotations (only vertical shown) on a 10 x 10 grid. All letters have five bits on at their inputs (a)-(d). For all rotations the redundancy reduced output for a T has only three bits on (e). However, the network correctly indicates that other letters are not T by maintaining or increasing their number of bits on (O-(h). for a given u look for the network with the smallest number of output bits on. I did find that sometimes these networks made false-positive errors. For example, the L network did falsely mistake a T for an L at one orientation. However, this type of error was rare, and two different P(ulL) networks (or any others) never made the same errors, so averaging the outputs of any two different ARCA networks always solved the problem. Also by using additional ARCA networks for each letter, the number of learning stages can be reduced. Instead of the five or six layers used above, a single layer using two ARCA networks, one odd and one even checkerboard (Fig. 6a) is sufficient to distinguish T from C, X, and L. This works because where the odd network errs the even network more than compensates, and vice versa. I have found that combining ARCA networks in this way always sharpens the distinction between one concept and others.
Acknowledgments I wish to thank J. Atick and Z. Li for reading the manuscript and providing many helpful insights and suggestions. Also, this work was s u p ported in part by a grant from the Seaver Institution.
Supervised Factorial Learning
765
References
Atick, J. J., and Redlich, A. N. 1990. Towards a theory of early visual processing. Neural Comp. 2, 308-320. Atick, J.J.,and Redlich, A. N. 1993. Convergent algorithm for sensory receptive field development. Neural Comp. 5, 45-60. Atick, J. J.,and Redlich, A. N. 1992. What does the retina know about natural scenes? Neural Comp. 4, 196-210. Atick, J. J., Li, Z., and Redlich, A. N. 1992. Understanding retinal color coding from first principles. Neural Comp. 4,559-572. Atick, J. J., Li, Z., and Redlich, A. N. 1993. What does post-adaptation color appearance reveal about cortical color representation? Vision Res. 33(1), 123-1 29. Barlow, H. B. 1961. Possible principles underlying the transformation of sensory messages. In Sensory Communication, W. A. Rosenblith, ed. MlT Press, Cambridge, MA. Barlow, H. B. 1989. Unsupervised learning. Neural Comp. 1, 295-311. Barlow, H.B. 1992. The biological role of the neocortex. Proceeding of the Brain Theory Meeting, Ringberg, April 1990. Springer, Berlin. Barlow, H. B., and Foldiak, P. 1989. Adaptation and decorrelation in the cortex. In The Computing Neuron, R. Durbin, C. Maill, and G. Mitchison, eds. Addison-Wesley, Wokingham, England. Eriksson, K., Lindgren, K., and Mansson, B. A. 1987. Structure, Context, Complexity, Organization, Chapter 4. World Scientific, Singapore. Fahlman, S. E., and Lebiere, C. 1990. The cascade-comlationlearning algorithm. In Advances in Neural Information Processing Systems 2, D. S. Touretzky, ed. Morgan Kaufmann, San Mateo, CA. Field, D. J. 1987. Relations between the statistics of natural images and the response properties of cortical cells. 1.Opt. SOC.Am. 4, 2379-2394. Field, D.J. 1989. What the statistics of natural images tell us about visual coding. In Human Vision, Visual Processing, and Digital Display. SPIE Vol. 1077, pp. 269-276. Hinton, G. E., and Sejnowski, T. J. 1986. Learning and relearning in Boltzmann machines. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland eds., Vol. 1, pp. 282-317. MIT Press, Cambridge, MA. Linsker, R. 1988. Self-organization in a perceptual network. Computer 21, 105117. Linsker, R. 1989. An application of the principle of maximum information preservation to linear systems. In Advances in Neural Information Processing Systems, D. S . Touretzky, ed., Vol. 1, pp. 186-194. Morgan Kaufmann, San Mateo, CA. Margolus, N. 1984. Physics-like models of computation. Physica 10D,81-95. Redlich, A. N. 1993. Redundancy reduction as a strategy for unsupervised learning. Neural Comp. 5, 289-304. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing,
766
A. Norman Redlich
D. E. Rumelhart and J. L. McClelland, eds., Vol. 1, pp. 318-362. MIT Press, Cambridge, MA. Shannon, C. E., and Weaver, W. 1949. The Mathematical Theory of Communication. The University of Illinois Press, Urbana. Shepherd, G . M. 1988. Neurobiology. Oxford University Press, New York. Shepherd, G. M. 1990. The Synaptic Organization of the Brain. Oxford University Press, New York. Srinivisan, M. V., Laughlin, S. B., and Dubs, A. 1982. Predictive coding: A fresh view of inhibition in the retina. Proc. R. SOC. London Ser. B 216,427-459. Received 18 September 1992; accepted 15 January 1993.
This article has been cited by: 2. D. Obradovic , G. Deco . 1998. Information Maximization and Independent Component Analysis: Is There a Difference?Information Maximization and Independent Component Analysis: Is There a Difference?. Neural Computation 10:8, 2085-2101. [Abstract] [PDF] [PDF Plus] 3. William B Levy , Robert A. Baxter . 1996. Energy Efficient Neural CodesEnergy Efficient Neural Codes. Neural Computation 8:3, 531-543. [Abstract] [PDF] [PDF Plus] 4. Lucas Parra , Gustavo Deco , Stefan Miesbach . 1996. Statistical Independence and Novelty Detection with Information Preserving Nonlinear MapsStatistical Independence and Novelty Detection with Information Preserving Nonlinear Maps. Neural Computation 8:2, 260-269. [Abstract] [PDF] [PDF Plus] 5. Michael Haft, Martin Schlang, Gustavo Deco. 1995. Information theory and local learning rules in a self-organizing network of Ising spins. Physical Review E 52:3, 2860-2871. [CrossRef] 6. Gustavo Deco, Bernd Schürmann. 1995. Learning time series evolution by unsupervised extraction of correlations. Physical Review E 51:3, 1780-1790. [CrossRef]
Communicated by A. B. Bonds
On Learning Perceptrons with Binary Weights Mostefa Golea Mario Marchand Ottawa-Carleton Institutefor Physics, University of Ottawa, Ottawa, Ontario, Canada KIN 6N5 We present an algorithm that PAC learns any perceptron with binary weights and arbitrary threshold under the family of product distributions. The sample complexity of this algorithm is of O[(n/~)~ln(n/S)] and its running time increases only linearly with the number of training examples. The algorithm does not try to find an hypothesis that agrees with all of the training examples; rather, it constructs a binary perceptron based on various probabilistic estimates obtained from the training examples. We show that, under the restricted case of the uniform distribution and zero threshold, the algorithm reduces to the well known clipped He&&rule. We calculate exactly the average generalization rate he., the learning curve) of the algorithm, under the uniform distribution, in the limit of an inlinite number of dimensions. We find that the error rate decreases exponentially as a function of the number of training examples. Hence, the average case analysis gives a sample complexity of O[nln(l/e)], a large improvement over the PAC learning analysis. The analytical expression of the learning curve is in excellent agreement with the extensive numerical simulations. In addition, the algorithm is very robust with respect to classification noise. 1 Introduction The study of neural networks with binary weights is well motivated from both the theoretical and practical points of view. Although the number of possible states in the weight space of a binary network is finite, the capacity of the network is not much inferior to that of its continuous counterpart (Barkai and Kanter 1991). Likewise, the hardware realization of binary networks may prove simpler. Although networks with binary weights have been the subject of intense analysis from the capacity point of view (Barkai and Kanter 1991; Kohler et al. 1990; Krauth and M6zard 1989; Venkatesh 19911, the question of the learnability of these networks remains largely unanswered. The reason for this state of affairs lies perhaps in the apparent strength of the followingdistribution-freeresult (Pitt and Valiant 1988): learning perceptrons with binary weights is equivalent to Integer Programming and so, Neural Computation 5,767-782 (1993) @ 1993 Massachusetts Institute of Technology
768
Mostefa Golea and Mario Marchand
it is an NP-Complete problem. However, this result does not rule out the possibility that this class of functions is learnable under some reasonable distributions. In this paper, we take a close look at this possibility. In particular, we investigate, within the PAC model (Valiant 1984; Blumer et al. 1989), the learnability of single perceptrons with binary weights and arbitrary threshold under the family of product distributions. A distribution of examples is a product distribution if the the setting of each input variable is independent of the settings of the other variables. The result of this investigation is a polynomial time algorithm that PAC learns binary perceptrons under any product distribution of examples. More specifically, the sample complexity of the algorithm is of O[(n/6)41n(n/6)], and its running time is linear in the number of training examples. We note here that the algorithm produces hypotheses that are not necessarily consistent with all the training examples, but that nonetheless have very good generalization ability. These type of algorithms are called “inconsistent algorithms” (Meir and Fontanari 1992). How does this algorithm relate to the learning rules proposed previously for learning binary perceptrons? We show that under the uniform distribution and for binary perceptrons with zero-threshold, this algorithm reduces to the clipped Hebb rule (Kohler et al. 1990) [also known as the majority rule (Venkatesh 1991)l. To understand the typical behavior of the algorithm, we calculate exactly under the uniform distribution, its average generalization rate (i.e., the learning curve) in the limit of an infinite number of input variables. We find that, on average, the generalization rate converges exponentially to 1 as a function of the number of training examples. The sample complexity in the average case is of O[nln(l/~)], a large improvement over the PAC learning analysis. We calculate also the average generalization rate when learning from noisy examples and show that the algorithm is very robust with respect to classification noise. The results of the extensive simulations are in very good agreement with the theoretical ones. 2 Definitions
Let I denote the set {-1, +l}. A perceptron g on the instance space F is specified by a vector of n weight values wi and a single threshold value 8. For an input vector x = ( x I , x ~. ., . ,x,) E F,we have (2.1) A perceptron is said to be positive if wi 2 0 for i = 1,. . . ,n. We are interested in the case where the weights are binary valued (fl).We assume, without loss of generality (w.l.o.g.), that 6’ is integer and -n - 1 5 0 5 n.
Learning Perceptrons with Binary Weights
769
An example is an input-output pair < x , g ( x ) >. A sample is a set of examples. The examples are assumed to be generated randomly according to some unknown probability distribution D that can be any member of the family 2) of all product distributions. Distribution D belongs to V if and only if the setting of each input variable xi is chosen independently of the settings of the other variables. The uniform distribution, where each xi is set independently to f l with probability 1/2, is a member of 2). We denote by P(A) the probability of event A and by @ ( A )its empirical estimate based on a given finite sample. All probabilities are taken with respect to the product distribution D on I". We denote by E(r) and Var(x) the expectation and variance of the random variable x. If a, b E {-l,+l}, we denote by P(g = blxi = a ) the conditional probability that g = b given the fact that xi = a. The injuence of a variable xi, denoted Inf(xi), is defined as Inf(xi) = P ( g = +I(xi = +I) - P(g = +Itxi = -1) - P(g = -1 IXi = +1) P(g = -1 [ X i = -1)
+
(2.2)
Intuitively, the influence of a variable is positive (negative) if its weight is positive (negative). 3 PAC Learning Single Binary Percepbons
3.1 The Learning Model. In this section, we adopt the PAC learning model (Valiant 1984; Blumer et al. 1989). Here the methodology is to draw, according to D, a sample of a certain size labeled according to an unknown target perceptron g and then to find a "good" approximation g' of g . The error of the hypothesis perceptron g', with respect to the target g , is defined to be P(g' # g) = P[g'(x) # g ( x ) ] ,where x is drawn according to the same distribution D used to generate the training sample. An algorithm PAC learns from examples the class G of binary perceptrons, under a family 2, of distributions on I", if for every g E G, any D E '0, and any 0 < ~ , < 6 1, the algorithm runs in time polynomial in (n,l/q 1/6) and outputs, with probability at least 1 - 6, an hypothesis g' E G that makes an error at most 6 with g . 3.2 The Learning Algorithm. We assume that the examples are generated according to a (unknown) product distribution D on { -1, +1}" and labeled according to a target binary perceptron g given by equation 2.1. The learning algorithm proceeds in three steps: 1. Estimating, for each input variable xi, the probability that it is set to +l. If this probability is too high (too low), the variable is set to +1 (-1). Note that setting a variable to a given value is equivalent to neglecting this variable because any constant can be absorbed in the threshold.
Mostefa Golea and Mario Marchand
770
2. Estimating the weight values (signs). This is done by estimating the influence of each variable. 3. Estimating the threshold value.
To simplify the analysis, we introduce the following notation. Let y be the vector whose components yi are defined as (3.1)
yi = Wi X Xi
Then equation 2.1 can be written as
In addition, we define Inf(yi) by Inf(yi) = P(g = +llyi = +1) - P(g = +llyi = -1) - P(g = -qyi = +1) + P ( g = -11yi = -1)
(3.3)
Note that if D ( x ) is a product distribution on {-1, +1}", then so is D ( y ) .
Lemma 1. Let g be a binary perceptron. Let Xi be a variable in g . Let a E (-1, +l}. Let g' be a perceptron obtained from g by setting xi to a. Then, if P(xi = -a) 5 c/2n,
Proof. Follows directly from the fact that P(g # g') 5 P(Xi = -a).
0
Lemma 1 implies that we can neglect any variable xi for which P(xi = f l ) is too high (too low). In what follows, we consider only variables that have not been neglected. As we said earlier, intuition suggests that the influence of a variable is positive (negative) if its weight is positive (negative). The following lemma strengthens this intuition by showing that there is a measurable gap between the two cases. This gap will be used to estimate the weight values (signs).
Lemma 2. Let g be a perceptron such that P ( g = +l),P(g = -1) > p, where 0 < p < 1. Then for any product distribution D, Inf(xi)
{
> <
-A
if W j = + 1 ifwi=-1
Proof. We first note that from the definition of the influence and equations 3.1 and 3.3, we can write
Learning Perceptrons with Binary Weights
771
We exploit the independence of the input variables to write
(3.4) One can also write
(3.5) Likewise,
Let p(r) denote P(xj+iyj= Y). From the properties of the generating function associated with product distributions, it is well known (Ibragimov 1956; MacDonald 1979) that p ( r ) is always unirnodul and reaches its maximum at a given value of r, say rmx. We distinguish two cases: 0 2 Y-: in this case, using equation 3.5 n
(3.7)
Mostefa Golea and Mario Marchand
772
Using equations 3.4 and 3.7, it easy to see that
6
5 rmaw- 1: in this case, using equation 3.6 r=-n
i#i
6 (n+e+2) xp(e+i)
(3.8)
Using equations 3.4 and 3.8 it easy to see that
So, if we estimate Inf(xi) to within a precision better than the gap established in Lemma 2, we can determine the value of Wi with enough confidence. Note that if 0 is too large (small), most of the examples will be negative (positive). In this case, the influence of any input variable is very weak. This is the reason we require P(g = +l),P(g = -1) > p. The weight values obtained in the previous step define the weight vector of our hypothesis perceptron g’. The next step is to estimate an appropriate threshold for g’, using these weight values. For that, we appeal to the following lemma. Lemma 3. Let g bea perceptron with a threshold 8. Let gbea perceptron obtained from g by substituting r for 6 . Then, if r 6 8,
P ( g # g’) I 1 - P(g = +1y = +1) Proof.
P(g # 8’) 5 P(g = -1Ig’ = +1)+ P(g = +llg’ = -1) = 1 - P(g = +1(g’ = +1) + P(g = +l(g’ = -1) = 1 - P(g = +llg’ = +1) The last equality follows from the fact that P(g = +11g’ = -1) = 0 for T
5 e.
So, if we estimate P(g = +Ilg‘ = +1) for r = -n - 1,-n, -n + 1,.. . and then choose as a threshold for g‘ the least r for which P(g = +lid = +1) 2 (1- c), we are guaranteed to have P(g # g‘) I c. Obviously, such an t exists and is always 5 0 because P(g = +llg‘ = +1) = 1 for r = 8. A sketch of the algorithm for learning single binary perceptrons is given in Figure 1.
Theorem 1. The class of binary perceptrons is PAC learnable under the family of product distributions.
Learning Perceptrons with Binary Weights
773
Algorithm LEARN-BINARY-PERCEPTRON(n,e,6) Parameters: n is the number of input variables, 6 is the accuracy parameter and 6 is the confidence parameter. Output: a binary perceptron g' defined by a weight vector (w1,... ,w,) and a threshold Y. Description: 1. Call m =- 1 In examples. This sample will be used to estimate the different probabilities. Initialize g' to the constant perceptron -1.
2. (Are most examples positive?) If P(g = +1) L (1- f) then set g' = 1 and return g'. 3. (Are most examples negative?) If P(g = +1) 5 f then set g' = -1 and return g'. 4. Set p =
4.
5. (1s P(Xi = +1) too low or too high ?) For each input variable xi: (a) Estimate P(xi = +l). (b)
If + ( X i = +1) 5 f or 1-P(xi = +1) 5 f , neglect this variable.
6. (Determine the weight values) For each input variable xi:
(a) If ~ ; l f ( x i>) f&, (b)
set wi = I.
Else if I;lf(xi) < -;A, set wi = -1.
(c) Else set wi = 0 (xi is not an influential variable). 7. (Estimatingthe threshold) Initialize Y (the threshold of g') to -(n+l).
(a) Estimate P(g = +llg' = +l).
If P(g = +llg' = +1) > 1 - &, go to step 8. (c) Y = Y + 1. Go to step 7a.
(b)
8. Return g' (that is ( w l , . . ,w,; Y)).
Figure 1: An algorithm for learning single binary perceptrons on product distributions.
Proof. Using Chernoff bounds (Hagerup and Rub 1989), one can show that a sample of size rn = [160n(n l)]*/e4 ln32n/6 is sufficient to ensure that
+
0
IP(g = a ) - P(g = u)1 5 4 4 with confidence at least 1 - 6/2.
Mostefa Golea and Mario Marchand
774 lP(Xi
0
0
= a ) - P(Xi = a)l
5 c/4n with confidence at least 1 - 6/4n.
IInf(xi)- Inf(xi)l 5 4 4 ( n + 1) with confidence at least 1 - 6/8n.
IP(g = +l(g‘ = +1) - P(g = +llg‘ = +I)( 5 c/4 with confidence at least 1 - 6/16n.
Combining all these factors, it is easy to show that the hypothesis g‘ returned by the algorithm will make an error at most E with the target g, with confidence at least 1 - 6. Since it takes m units of time to estimate a conditional probability using a sample of size m, the running time of the algorithm will be of O(m x n).
4 Reduction to the Clipped Hebb Rule
The perceptron with binary weights and zero-threshold has been extensively studied by many authors (Krauth and Mezard 1989; Kohler et al. 1990; Opper et al. 1990; Venkatesh 1991). All these studies assume a uniform distribution of examples. So, we come to ask how the algorithm of Figure 1 relates to the learning rules proposed previously To answer this, let us first rewrite the influence of a variable as
and observe that under the uniform distribution, P(Xi = +1) = P(Xi = -1). Next, we notice that in the algorithm of Figure 1, each weight wi is basically assigned to the sign of Inf(xi). Hence apart from E and 6, the algorithm can be summarized by the following rule: wj
=
sgn [myxi)]
=
sgn Cg(x”)xr
r
lu
1
I
(4.1)
where sgn(x) = +1 when x > 0 and -1 otherwise and xiy denotes the ith component of the vth training example. Equation 4.1 is simply the well known clipped Hebb rule (Opper et al. 19901, also called the majority rule in Venkatesh (1991). Since this rule
Learning Perceptrons with Binary Weights
775
is just the restriction of the learning algorithm of Figure 1 for uniform distributions, Theorem 1 has the following corollary:
Corollary 1. The clipped Hebb rule PAC learns the class of binary perceptrons with zero thresholds under the uniform distribution. 5 Average Case Behavior in the Limit of Infinite n The bound on the number of examples needed by the algorithm of Figure 1 to achieve a given accuracy with a given confidence is overly pessimistic. In our approach, this overestimate can be traced to the inequalities present in the proofs of Lemmas 2 and 3 and to the use of the Chernoff bounds (Hagerup and Rub 1989). To obtain the typical behavior of the algorithm we will calculate analytically, for any target perceptron, the average generalization rate (i.e., the learning curve). By generalization rate we mean the curve of the generalization ability as a function of the size of the training set m. The central limit theorem will tell us that the average behavior becomes the typical behavior in the limit of infinite n and infinite m with Q = m / n kept constant. As it is generally the case Wallet 1989; @per et al. 1990; Opper and Haussler 19911, we limit ourselves, for the sake of mathematical simplicity, to the case of uniform distribution and zero threshold. Therefore, we will calculate the average generalization rate of the clipped Hebb rule (hereafter CHR) (equation 4.1)for both noise-free and noisy examples. 5.1 Zero Noise. Let w t= (wi,w i , . . . ,w;) be the target weight vector and let w = (w,,wz, . . .,w,) be the hypothesis weight vector constructed by the CHR with m training examples. The generalization rate G is defined to be the probability that the hypothesis agrees with the target on a random example x chosen according to the uniform distribution. Let us start by defining the following sums of random variables:
x
n
=
cwixi
(5.1)
i=l
n
Y =
CWfXi
(5.2)
i=l
The generalization rate is given by
G = P [sgn(X)= sgn(Y)] = P[XY>O]
(5.3) (5.4)
where we have assumed w.1.o.g. that n is an odd number. Since x is
Mostefa Golea and Mario Marchand
776
distributed uniformly, we easily find that
E(X) = E ( Y ) = O Var(X) = Var(Y) = n
(5.5) (5.6)
n
nxp
=
E(XY) = C w i w f
(5.7)
i=l
where -1 5 p 5 +1 is defined to be the normalized overlap between the target and the hypothesis weight vector. According to the central limit theorem, in the limit n + 00, X and Y will be distributed according to a bivariate normal distribution with moments given by equations 5.5, 5.6, and 5.7. Hence, for fixed w tand w,the generalization rate G is given by
where the joint probability distribution p(x,y) is given by
This integral easily evaluates to give 1 G(p) = 1 - -arccosp lr
(5.8)
So, as n -+ 00, the generalization rate depends only on the angle between the target and the hypothesis weight vectors. Now, to average this result over the all training samples of size m, we argue that for large n, the distribution of the random variable p becomes sharply peaked at its mean. Denoting the average over the training samples by << - >>, this amounts to approximating << G(p) >> by G(<< p >>) as n -+ 00. Using equation 5.7, we can write (for a fixed w9: <
>
1 "
= -CWf<<Wi>>
n i=l
(5.9) (5.10)
where pi is the probability that wlwi = +l. We introduce the independent random variables = wfxr, and use equation 4.1 to write: (5.11)
Learning Perceptrons with Binary Weights
777
Let us define the new random variables $'
(5.12) With that, pi can be written as
(5.13) Let q be the probability that q/'= +l. From equation 5.12, we can write 9 as
n-l-k
1
1
(5.14)
asnjoo
5+=
=
where, by using Stirling's formula, we have kept only the leading term in l / f i as n 00. Hence, in this limit, each T$' has unit variance and a mean of 2 / 6 . Since 7; and $"" are statistically independent, the central limit theorem tells us that, when m 00, the variable --$
-+
m
Z=C# V=l
becomes a gaussian variable with mean pZ = m x and variance m. Hence, as m 00 and a = m/n is kept constant, equation 5.13 becomes -+
(5.15) - 1 +-erf(&) 1
- 5
2 Hence, using equations 5.8, 5.10, and 5.16, we have finally: <
>
= erf(m)
<>
=
1
(5.16)
(5.17) (5.18)
This result is independent of the target w'. The average generalization rate and normalized overlap are plotted in Figure 2 and compared with numerical simulations. We see that the
Mostefa Golea and Mario Marchand
778
1.o
I
03 0.8 0.7 0.6
0.5 I 0
5
10
15
20
a 1.o
g
1
0.8
0.8 0.4
f
0.2
' t
0.0
0
5
10
15
20
a
Figure 2: (a) The average generalization rate << G >> and (b) the average normalized overlap << p >> as a function of normalized number of examples (1 = m/n. Numerical results are shown for n = 50,100,500. Each point denotes an average over 50 different training samples and the error bars denote the standard deviations. agreement with the theory is excellent, even for moderate values of n. Notice that the agreement is slightly better for << p >> than it is for << G >>. This illustrates the difference between << G(p) >> and G(<>). To compare this average-case analytic result to the bounds given by PAC learning, we use the fact that we can bound erf(z) by an exponential (Abramowitz and Stegun 1972) and thus bound the error rate 1 - << G >> = by (5.19)
Learning Perceptrons with Binary Weights
779
That is, the error rate decreases exponentially with the number of examples and, on average, a training set of size of O[nln(l/e)] is sufficient to produce an hypothesis with error rate 6. This is an important improvement over the bound of O [ ( r ~ /ln(n/b)] e ) ~ given by our PAC learning analysis. Thus, the CHR is a striking example of a very simple “inconsistent”algorithm that does not always produce hypotheses that agrees with all the training examples, but nonetheless produce hypotheses with outstanding generalization ability. Moreover, the exponential convergence outlines the computational advantage of learning binary perceptrons using binary perceptrons. In fact, if one allows real weights, no algorithm can outperform the Bayes optimal algorithm (Opper and Haussler 1991). The latter’s error rate improves only algebraically, approximately as 0.44/(r. On the other hand, for consistent learning rules that produce perceptrons with binary weights, a phase transition to perfect generalization is known to take place at a critical value of a (Sompolinsky et al. 1990; Gyorgyi 1990). Thus, these rules have a slightly better sample complexity than the CHR. Unfortunately, they are much more computationally expensive (with a running time that generally increases exponentialIy with the number of inputs n). Since it is an “inconsistent” learning rule, the CHR does not exhibit a phase transition to perfect generalization. We think that the exponential convergence is the reminiscence of the “lost” phase transition. An interesting question is how the CHR behaves when learning binary perceptrons on product distributions. To answer this, we first note that the CHR works by exploiting the correlation between the state of each input variable xi and the classification label (equation 4.1). Under the uniform distribution, this correlation is positive if wf= +1 and negative if wf = -1. This is no more true for product distributions: one can easily craft some malicious product distributions where, for example, this correlation is negative although wi = +l. The CHR will be fooled by such distributions because it does not take into account the fact that the settings of the input variables do not occur with the same probability. The algorithm of Figure 1 fixes this problem by taking this fact into consideration, through the conditional probabilities. Finally, it is important to mention that binary perceptrons trained with the CHR on examples generated uniformly will perform well even when tested on examples generated by nonuniform distributions, as long as these distributions are reasonable [for a precise definition of reasonable distributions, see Bartlett and Williamson (1991)l. 5.2 Classification Noise. In this section we are interested in the generalization rate when learning from noisy examples. We assume that the classification label of each training example is flipped independently with some pmbability 0 . Since the object of the learning algorithm is to construct an hypothesis w that agrees the most with the underlying tar-
Mostefa Golea and Mario Marchand
780
get wfl the generalization rate G is defined to be the probability that the hypothesis agrees with the noise-free target on a new random example x. The generalization rate for a fixed w and W*is still given by equation 5.8. To calculate the effect of noise on << p >>, let us define q' as the probability that $' = +1 in the presence of noise whereas q denotes this probability in the noise-free regime (i.e., equation 5.14). These two probabilities are related by:
+
q' = q(1- a) (1 -q)o 1 I-2a = +asn-,oo
2Jz?nz
(5.20) (5.21)
where we have used equation 5.14 for the last equality. This leads to the following expressions for the normalized overlap and the generalization rate, in the presence of noise:
<< p > >
= erf
<< G >>
= 1-
(5.22) lr
(5.23)
One can see that the algorithm is very robust with respect to classification noise: the average generalization rate still converges exponentially to 1 as long as a < 1/2. The only difference with the noise-free regime is the presence of the prefactor (1 - 20). The average generalization rate for different noise levels a is plotted in Figure 3. We see that the numerical simulations are in excellent agreement with the theoretical curves. 6 Summary
We have proposed a very simple algorithm that PAC learns the class of perceptrons with binary weights and arbitrary threshold under the family of product distributions. The sample complexity of this algorithm is of O[(n/~)~ln(n/S)] and its running time increases only linearly with the sample size. We have shown that this algorithm reduces to the clipped Hebb rule when learning binary perceptrons with zero threshold under the uniform distribution. We have calculated exactly its learning curve in the limit n oo where the average behavior becomes the typical behavior. We have found that the error rate converges exponentially to zero and have thus improved the sample complexity to O[nl n ( l / ~ ) ]The . analytic expression of the learning curve is in excellent agreement with the numerical simulations. The algorithm is very robust with respect to random classification noise. --$
Learning Perceptrons with Binary Weights
781
--
o noise 0.0 noise 0.1 A nobe- 0 2 nolse- 0.4 Theory: nobe-0.0 Theory: nolse-0. I Theory: nobe-0.2 Theory: nohe-0.4
-
-.-.-
---*-a-
a
5
10
15
20
25
30 a
Figure 3: The average generalization rate << G >> for different noise levels u. Numerical results are shown for n = 100. Each point denotes the average over 50 different simulations (i.e., 50 different noisy training sets). The error bars (indicated only for 0 = 0.4 for clarity) denote the standard deviations.
Acknowledgments
This work was supported by NSERC grant OGP0122405. References Abramowitz, M., and Stegun, I. A. 1972. Handbook of Mathematical Functions. Dover, New York. (eq. 7.1.13). Barkai, E., and Kanter, I. 1991. Storage capacity of a multilayer neural network with binary weights. Europhys. Lett. 14,107-112. Bartlett, P. L., and Williamson, R. C. 1991. Investigating the distribution assumptions in the PAC learning model. In Proceedings of the 4th Workshop on Computational Learning Theory, pp. 24-32. Morgan Kaufmann, San Mateo, CA. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, K. 1989. Learnability and the Vapnik-Chervonenkis dimension. 1.ACM, 36,929-965. Gyorgyi, G. 1990. First-order transition to perfect generalization in a neural network with binary synapses. Phys. Rev.A 41, 7097-7100. Hagerup, T., and Rub, C. 1989. A guided tour to Chernoff bounds. Info. Proc. Lett. 33, 305-308. Ibragimov, I. A. 1956. On the composition of unimodal distributions. Theory Prob. Appl. 1,255-260.
782
Mostefa Golea and Mario Marchand
Kohler, H., Diederich, S., Kinzel, W., and Opper, M. 1990. Learning algorithm for a neural network with binary synapses. Z. Phys. B 78,333-342. Krauth, W.,and Mezard, M. 1989. Storage capacity of memory networks with binary couplings. I. Phys. France 50,3057-3066. MacDonald, D.R. 1979. On local limit theorems for integer-valued random variables. The0y Prob. Statistics Acad. Nauk. 3, 607-614. Meir, R.,and Fontanari,J. F. 1992. Calculation of learning curves for inconsistent algorithms. Phys. Rev. A 92,8874-8884. Opper, M., and Haussler, H. 1991. Generalizationperformance of Bayes optimal classification algorithm for learning a perceptron. Phys. Rev. Lett. 66, 26772680. Opper, M., Kinzel, W., Kleinz, J., and Nehl, R. 1990. On the ability of the optimal perceptron to generalize. I. Phys. A: Math. Gen. 23,L.5814586. Pitt, L., and Valiant, L. G. 1988. Computational limitations on learning from examples. I. ACM 35,965-984. Sompolinsky,H.,Tishby, N., and Seung, H. S. 1990. Learning from examples in large neural networks. Phys. Rev. Lett. 65, 1683-1686. Vallet, F. 1989. The Hebb rule for learning linearly separable Boolean functions: Learning and generalization. Europhys. Lett. 8, 747-751. Valiant, L. G. 1984. A theory of the learnable. Cornm. ACM 27,1134-1142. Venkatesh, S. 1991. On learning binary weights for majority functions. In Proceedings of the 4th Workshop on Computational Learning Theoy, pp. 257-266. Morgan Kaufmann, San Mateo, CA. Received 29 July 1992; accepted 26 January 1993.
This article has been cited by: 2. Shao C. Fang, Santosh S. Venkatesh. 1999. Learning finite binary sequences from half-space data. Random Structures and Algorithms 14:4, 345-381. [CrossRef]
Communicated by Eric Baum
Construction of Minimal n-2-n Encoders for Any n D. S.Phatak H.Choi I. Koren Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003 USA
The encoding problem (Rumelhart and McClelland 1986) is an important canonical problem. It has been widely used as a benchmark. Here, we have analytically derived minimal-sized nets necessary and sufficient to solve encoding problems of arbitrary size. The proofs are constructive: we construct n-2-n encoders and show that two hidden units are also necessary for n > 2. Moreover, the geometric approach employed is general and has much wider applications. For example, this method has also helped us derive lower bounds on the redundancy necessary for achieving complete fault tolerance (Phatak and Koren 1992a,b). 1 Introduction
The encoding problem is an important canonical problem for neural networks (Rumelhart and McClelland 1986). In this problem, a set of orthogonal input patterns is mapped onto a set of orthogonal output patterns through a (small) set of hidden units. Typically, the inputs and outputs are assumed to be binary. There are n input units, n output units, and rn hidden units where m = log,n. The hidden units are generally arranged in a single layer resulting in three layers of units. There are n input/output patterns. The hidden units are expected to form some sort of compact code for each of the patterns. Henceforth, we refer to an encoding problem of size n by the acronym n x n problem and a net for a problem of this size that has m hidden units as an n-m-n encoding net. The inputs and outputs of the units are continuous valued. That raises the question: are log, n hidden units necessary to solve an n x n problem? If fewer units can do the job, what is the minimum number of units needed for an n x n encoding problem? We have analytically derived this minimum number of hidden units and established the capabilities of n-m-n encoding nets. The next section describes the topology and states the assumptions. Section 3 presents and proves the results on the bounds and related parameters. The following sections present discussion and conclusion. Neural Computation 5,783-794 (1993) @ 1993 Massachusetts Institute of Technology
D. S. Phatak, H. Choi, and I. Koren
704
Output layer, n
units
Hidden layer, rn units
Inputlayer, n units
Figure 1: An n-m-n encoding net.
2 Topology
The network is arranged into three layers as shown in Figure 1. Every unit in a layer feeds all other units in the next layer. There are no layerskipping connections. Besides the incoming weights, each unit (in the hidden and output layers) has one more independently adjustable parameter, that is, threshold or bias. The units are assumed to be sigmoidal and the output of the ith unit is given by Outputi = S(resultuntinputi)
-
and 1 +e+ resultuntinputi = netinputi - biasi S(u) =
netinputi
=
where
1
Cwpj
and (2.1)
j=1
Here, r is the number of units that feed unit i and Wij is the weight of the link from unit j (sender) to i (receiver). The output is considered to be on or at logical level “1” if it is greater than or equal to 0.50, and off or at level “0” if it is less than 0.50. The input patterns are the rows of n x n identity matrix. The target outputs are identical to the inputs, that is, the hidden units are expected to simply replicate the input pattern onto the output layer. The hidden layer encodes each of the n patterns with rn < n units and the output layer decodes the compact codes developed by the hidden units back to the original patterns.
Construction of Minimal n-2-n Encoders
785
(4)
Figure 2: A 2-1-2 encoding net with weights and biases zoi and bi > 0 for all i. Unit indices are shown in parentheses. 3 Results
With the above topology and assumptions we now proceed to state the following results. Theorem 1. An encoding net with one single hidden unit (i.e., m = 1 ) can learn at most 2 x 2 encoding problem.
Proof. That it can learn 1 x 1 and 2 x 2 problems can be demonstrated by giving an example. In Figure 2, a 2-1-2 net is illustrated along with all the weights and biases. Unit numbers are shown in parentheses and the bias values are indicated inside the circles representing the units. Units 4 and 5 constitute the input layer and 1 and 2 belong to the output layer. It can be verified that w3 = w4 = bl = b2 = 5.0; w 1 = w2 = 10.0 along with the signs indicated in the figure lead to correct reproduction of the two input patterns (viz. (1,O) and (0,l)) at the output layer. This is one of the infinitely many sets of weight and bias values that lead to correct outputs.
D. S. Phatak, H. Choi, and I. Koren
786
We now prove that it is impossible to reproduce 3 x 3 patterns using only one hidden unit. Here, the hidden unit must have three distinct outputs, one corresponding to each of the three input patterns, otherwise the output units cannot distinguish between those patterns that map onto the same output value of the hidden unit. Denote the three distinct outputs of the hidden unit as 01, 02, and 03, respectively, where without loss of generality, 01 > 02 > 03. Let the weights from the hidden unit to the output units be wl,W Z , and w3 and biases of the output units be fll, 02, and fls, respectively. Then, the resultant input to the ith output unit (denoted by yi) is given by yi = WiX - fli
where
and
i = 1,2,3
x = 01,02,03
(3.2)
Here, x denotes the output of the hidden unit. Note that the functions fi(x) =
x =
S[Yi(x)]=
1 1 + c-(wix - fli)
where i = 1 , 2 , 3 and (3.3)
o1,02,03
are monotonic. Without loss of generality, the input patterns are assumed to be {l,O,O}, {O,l,O}, and {O,O,l}. These same patterns should be reproduced at the output, which implies "1";
fl(0,)
=
fi(o1)
> 0.5;
fz(o1) = "0"; fz(oi) < 0.5;
fi(o2)
= "0";
fi(02)
< 0.5; "1"; > 0.5;
fz(o2) =
fz(0z)
f3(01)
= "0";
fs(o2) =
f3(01)
< 0.5;
f3(oz)
"0";
< 0.5;
fl(o3) fi(03)
= "O",
f2(03) = fZ(o3)
"O",
(3.4)
that is,
< 0.5 "l", > 0.5
f3(03) = f3(o3)
that is,
< 0.5
(3.5)
that is, (3.6)
From equation 3.3 it is seen that constraints 3.4 and 3.6 can be satisfied since they obey monotonicity. Constraints 3.5, however, cannot be satisfied since the function on the left-hand side is monotonic while the required outputs on the right-hand side are not monotonic. It can be verified that for any permutation of input patterns and output values, the constraints on one of the three units are impossible to satisfy since the inputs to that unit are monotonic but the target outputs are not monotonic. Thus the 3 x 3 problem cannot be solved by just one hidden unit. The proof for the n x n sized problem with n > 3 is identical to the above proof for the 3 x 3 case. 0 There is a geometric interpretation of the above result that is illustrated in Figure 3. This interpretation is critical for the proof of the next theorem, which establishes a bound for the general n x n problem. For a 2-1-2 net, the output of the hidden unit corresponding to each of the (input) patterns can be represented by a point along one dimension or a line. Without loss of generality, choose that line to be the x axis. Then,
Construction of Minimal n-2-n Encoders
787
Figure 3: A geometric interpretation of the 2-1-2 encoding problem.
the output of the hidden unit corresponding to each of the two input patterns is a point between [0,1] on the x axis, as illustrated by points PI and P2 in Figure 3. Because of the one-to-one mapping from the input patterns to the points representing the outputs of the hidden unit, the symbols PI and P2 will also be used to refer to the patterns. The resultant input to the ith unit is given by equation 3.2, where i = 1 , 2 and wi and Bi are the weight and bias associated with the ith unit. Note that these equations represent straight lines (hyperplanes in general) in the x-y plane, as illustrated by lines I1 and 12 in Figure 3. Henceforth, we just use the labels 1 and 2 to refer to the output units as well as the corresponding lines (hyperplanes) implemented by the units. A point xo is considered to be on the positive side of the line y = wx - 0 if wxo- 0 > 0, and on the negative side of the line if wxo- 0 < 0. For example, in Figure 3, all points (on the x axis) to the right of point Q are on the positive side of line lI and on the negative side of line 12. The vertical distance PIA between point PI and the line l~ represents the resultantinput to output unit 1 for pattern PI. Similarly, distance PIB represents the resultantinput to unit 2 for pattern PI. It is useful to think of directed distances from the points PI,P2 to lines 1 1 , l ~ .If the direction is upwards (along +y axis), then the corresponding resultantinput is positive (i.e., the output of the unit is "l"), while a downward distance (along -y axis) implies a negative resultantinput ("0" output). For the patterns (points) on the positive side of the line, the resultant input to the corresponding unit is positive and the unit output is on or "1." Conversely, a unit is on only if the pattern lies on the positive side of the line it implements. Similarly, a unit is of if and only if the pattern lies on the negative side of the line corresponding to the unit.
D. S. Phatak, H. Choi, and I. Koren
788
Learning implies finding weights and biases that satisfy the constraints y1(01) > 0;
y1(02) < 0;
yz(01) < 0;
Yz(O2)
>0
(3.7)
The first two inequalities say that points PI and P2 must be on positive and negative sides of line 11, because unit 1 should be on for pattern 1 and off for pattern 2. The interpretation of the last two inequalities is similar. Together, the constraints imply that both lines II and 12 intersect the x axis between P1and P2 and that one of them has a positive slope and the other has a negative slope. Figure 3 illustrates a case where the points P I ,P2 and lines II, I2 satisfy the above constraints. In this figure, both I1 and 12 intersect the x axis at the same point Q. In general, this may not be the case, as long as the constraints are satisfied. In general, learning implies constraints similar to equation 3.7. The constraints are such that 1. An output unit is on for only one pattern. This means that the weight(s1 and bias associated with that unit define a hyperplane that has only one of the points Pi is on its positive side, all others are on its negative side. 2. Each point Pi is such that for the corresponding input pattern, only one output unit is on and this unit stays off for all other input patterns. This means that each of the points Pi it is on the positive side of exactly one hyperplane and on the negative side of all others.
In Figure 3, PI is on the positive side of only one line viz. lI and P2 is on the positive side of only one line viz. 12. Similarly line lI has only one point on its positive side viz. PI and line l2 has only one point on its positive side viz. P2. For the n x n encoding problem, it may be expected that the minimum number of hidden units required is a function of n. Contrary to this expectation, however, it turns out that only two hidden units are sufficient to solve any n x n problem for arbitrarily large n. Theorem 2. Only two hidden units are suflcient to encode and decode n x n puttems for any positive integer n.
Proof. We prove this by a geometric construction similar to the one illustrated above for the 2-1-2 case. Here the network is n-2-n, that is, there are n input units, 2 hidden units, and n output units. For each input pattern, the hidden units develop outputs that can be represented by a distinct point in the x-y plane, where the x coordinate denotes the output of the first hidden unit and the y coordinate denotes the output of the second hidden unit. These points are denoted by Pi, i = 1,2. . .n. The hidden units feed all the output units. Let the weight associated with the link between hidden unit 1 and output unit i be denoted by w,!. The weight from hidden unit 2 to output unit i is denoted by 4. Let the
Construction of Minimal n-2-n Encoders
789
bias of the output unit i be denoted by Bi. Then, the resultant input to the ith output unit (denoted by zi) is given by ti
=
w:x+z$y-Bi
(x,y) = (o;,o;),...,(oM
where
i = 1,...,n
and (3.8)
Here, x and y correspond to the axes or dimensions representing the outputs of the hidden units, and z represents the dimension that corresponds to the resuZtantinput to the output units. These equations represent (hyper) planes in the three-dimensionalspace that will henceforth be denoted by lli where i = 1,. . . ,n. These planes are the decision surfaces implemented by the corresponding units. We say that a point (x0,yo) is on the positive side of plane I& if
z0 = w:x0 +$yo
-
ei > o
(3.9)
and on the negative side if
z0 = w;x0 + z$y0 - ei < o
(3.10)
In order to map the input patterns onto the output patterns, the points pk and the planes IIi have to satisfy constraints similar to those listed above in the exposition on geometric interpretation. Once again we observe that each plane lli defines the output of one of the units in the output layer, and each of the points Pk corresponds to a pattern. An output unit is on only for one of the n patterns and of for the others. Similarly, each pattern has exactly one output unit on and all others of. These constraints can be geometrically interpreted as follows: 1. Each plane lli has only one point on its positive side, all other points are on its negative side.
2. Each point Pk is on the positive side of only one plane and on the negative side of all other planes. If there exist points P k and planes lli, i, k = 1,2, . . . , n that satisfy the above constraints, then they constitute a valid solution for the n x n problem using only two hidden units. Figure 4 shows the geometric construction that proves the existence of such solution(s). It shows a 6-24 case for the purpose of illustration, but the procedure can be applied to any n-2-n problem. As a first step toward the solution of the n-2-n problem, a regular polygon of n sides is constructed in the x-y plane. This is illustrated by the hexagon with vertices (a,b,c,d,e,f) drawn in solid linestyle in Figure 4. Next, every edge is extended beyond the vertex up to a point where it meets the extension of some other edge of the polygon, so that (isoceles) triangles are obtained on the exterior of the original polygon, with the edges of the polygon as the bases of these triangles. This is illustrated by the shaded triangles in Figure 4. Now consider the original polygon
790
D. S. Phatak, H. Choi, and I. Koren
Figure 4 The geometric construction to obtain the weights and biases for a 6 - 2 4 (or n-2-n in general) encoding net.
as the base of a pyramid or a cross section of the pyramid along the x-y plane. The faces of the pyramid intersect at a point directly (vertically) below (along the -2 direction) the center of the circumcircle of the polygon. In Figure 4, for example, the center of the circumcircle is labeled as V. The vertex of the hexagonal pyramid lies directly (vertically)below the point V (i-e., on a line in the -2 direction, directed into the page from point V). The n faces of the pyramid define the n planes l l i . The points Pk have to be located within the isoceles triangles on the exterior of the polygon in the x-y plane, in order to satisfy the two constraints mentioned above. One point is placed inside each triangle, as illustrated by points P I , . . . ,P6 inside the shaded triangles in Figure 4. With this construction, each plane rIi is such that only one point is on
Construction of Minimal n-2-n Encoders
791
its positive side and all other points are on its negative side. For example, in Figure 4, the plane I l l passing through the vertex of the pyramid and edge ab is such that only one point, viz., P1 is on its positive side while all others are on its negative side. Similarly, each point is on the positive side of exactly one plane and the negative side of all the others. In Figure 4, for example, point P2 is on the positive side of plane l l 2 only, and is on the negative side of all the other planes. Thus the points and planes satisfy all the above constraints and represent a valid solution. The outputs of all the units have to be in [0,11. This means that the entire diagram should be within the unit square in the x-y plane, which is bounded by vertices (O,O), (O,l), (l,O), (1,l). This is always possible to do since the polygon can be shrunk to any desired size so that the entire diagram can fit inside the unit square. This proves that a solution (in fact infinitely many of them) always exists for the n-2-n 0 problem and can be obtained by the above construction. 4 Discussion
The above results hold for the complementaryencoding problem (Usand 1’s are interchanged) as well. For a complementary encoding problem, the vertex of the pyramid in the above construction lies directly (vertically) above the circumcenter V, which is in the x-y plane. Also note that the I/O patterns for the complementary encoding problem are not mutually orthogonal. In the above construction, the points corresponding to the outputs of the hidden units must lie within the triangles formed on the edges of the polygon. Hence the area of the tiangles is, in a crude sense, related to the probability of finding a valid solution. The larger the area, the higher is the probability that the gradient descent will latch on to a valid solution. Note that the outputs of the hidden units are confined to be between two circles, viz., an inner circle which touches (is tangent to) each edge of the polygon and an outer circle that passes through the tips of all the tiangles on the exterior of the polygon. Both these circles are drawn in dotted linestyle in Figure 4. For a given n, the triangles have the largest area when the outer circle is as large as possible, that is, it touches the edges of unit square in the x-y plane. Hence the net is more likely to hit on this solution. This is consistent with the observation that neural nets tend to stabilize at vertices or comers of the solution space. As n 00, the circles approach each other and in the limit they coincide. This means that the volume (area in this case) of the solution space approach 0 and, therefore, the probability that the search algorithm converges to a valid solution also approaches 0, as expected. The distance (along the z direction) between the point P, and the corresponding plane n, represents the resultant input to a unit. In the limit as n + 00, the points Pi approach planes ni and the vertical distance --$
792
D. S. Phatak, H. Choi, and I. Koren
between the planes and the points approaches 0 as well. This means that the resultant inputs to the output units approach 0. Hence the outputs of units that are on approach 0.5 from above, that is, output values indicating a logical level “1” -+ 0.5+ and the outputs of the units that are off approach the limit 0.5 from the other side, that is, logical “0” + 0.5-. If the output tolerances are specified (for example a “1” cannot be below 0.75 and a “0” cannot be above 0.25) then, in the above construction, it is possible to find out the maximum value of n that will deliver the outputs within the desired tolerances, for a given m. Conversely, given an n, the number of hidden units, m, required to deliver the outputs within the specified tolerance can be also calculated from the above construction. If n 5 4, the “allowable” regions for the points Pi are no longer triangles since the edges of a regular polygon with n 5 4 sides when extended beyond the vertices do not intersect the extensions of any of the other edges. It should also be noted that in the above construction, the polygon need not be regular. If the polygon is not regular, however, some of the “allowable” areas shrink and the others expand. Also, the planes IIi need not intersect at the same point or need not form a pyramid, as long as the relative placement of the planes and the points satisfy the two constraints mentioned above. The unbounded allowable areas for the points Pi that arise due to n 5 4 or due to irregularity of the underlying polygon, as well as the asymmetry in allowable areas that arises when the polygon is irregular, is illustrated in Figure 5. Note that the construction remains the same in all these cases. The points Pi still have to be in the regions exterior to the polygon, and between the lines obtained by extending the edges of the polygon beyond the vertices. This is illustrated by the shaded regions in Figure 5. If the quadrilateral shown in Figure 5 was regular (i.e., it was a square), then all the “allowable” regions for the points Pi would be identical in shape and unbounded on one side. Because the quadrilateral is irregular, some allowable regions have shrunk and the others have grown. For example, the shaded region to the left of plane ll2 has shrunk from a rectangular strip unbounded on the left side, to a bounded and triangular region shown in the figure. Similarly the shaded region to the right of ll, has expanded from a rectangular strip to an unbounded quadrilateral. It seems that the symmetric solution is more fault tolerant. The reasoning is as follows. The edges and planes of the polygon can be jiggled without changing the classificationor logical output of the network. This corresponds to changing the weights and biases of the units represented by the planes. How much change is allowed in the weight and bias Values depends on n and other factors. For the symmetric solution, it is evident that whatever tolerance applies to a point or a plane also applies to all the other points or planes. In contrast, if the polygon is not regular or if the planes do not form a pyramid, then some points and planes must
Construction of Minimal n-2-n Encoders
793
Figure 5 The construction for the case when n 5 4 and the polygon is not regular.
be confined to smaller tolerances (smaller than the corresponding one in the symmetric case) while the others can have larger tolerance. The total amount of deviation allowed can be measured by the volume enclosed between the original positions of the planes and the extreme positions after large deviations in parameters (or faults), at which the solution (relative placement of planes and points) still satisfies the above constraints. It is conjectured that the total of such "fault-tolerance volumes" is maximum for the symmetric case, or, in other words, a symmetric solution is more fault tolerant. 5 Conclusion
Bounds have been established for the solution of the encoding problem using a feedforward network with one layer of hidden units. Existence of solution(s) is demonstrated by constructive proofs, leading to the actual solutions. The discussion reveals interesting connections to limiting cases, fault tolerance, probability of finding a valid solution and other
794
D. S. Phatak, H. Choi, and I. Koren
issues. The geometric interpretation is general and applicable to other problems as well. For instance, this approach was employed in Phatak and Koren (1992a,b) to derive lower bounds on the redundancy necessary to achieve complete fault tolerance for all single faults. The encoding problem directly reflects on the ability of the net to develop distributed representations among the hidden units and map them back onto localized representations on the output units. These results will possibly help to define a meaningful measure of the distributedness of representations. References Rumelhart, D. E., and McClelland, J. L. 1986. Parallel Distributed Processing, Vol. 1: Foundations. MIT Press, Cambridge, MA. Phatak, D. S., and Koren, I. 1992. Fault tolerance of feedforward neural nets for classification tasks. In Proceedings of International Ioint Conference on Neural Nets (IJCNN),Vol. 11, pp. 11386-11-391. Baltimore, MD. Phatak, D. S., and Koren, I. 1992. Complete and partial fault tolerance of feedfornard neural nets. Tech. Rep. TR-92-CSE-26, Electrical and Computer Engineering Department, University of Massachusetts, Amherst. Received 21 September 1992; accepted 12 January 1993.
This article has been cited by: 2. Elko B. Tchernev , Rory G. Mulvaney , Dhananjay S. Phatak . 2005. Perfect Fault Tolerance of the n-k-n NetworkPerfect Fault Tolerance of the n-k-n Network. Neural Computation 17:9, 1911-1920. [Abstract] [PDF] [PDF Plus] 3. Nathalie Japkowicz , Stephen José Hanson , Mark A. Gluck . 2000. Nonlinear Autoassociation Is Not Equivalent to PCANonlinear Autoassociation Is Not Equivalent to PCA. Neural Computation 12:3, 531-545. [Abstract] [PDF] [PDF Plus]
Communicated by Steven Nowlan
Recurrent and Feedforward Polynomial Modeling of Coupled Time Series Vicente Ldpez Ram6n Huerta JosC R. Dorronsoro Instituto de lngenieria del Conocimiento, Universidad Aut6noma de Madrid, 28049 Madrid, Spain
We present two methods for the prediction of coupled time series. The first one is based on modeling the series by a dynamic system with a polynomial format. This method can be formulated in terms of learning in a recurrent network, for which we give a computationally effective algorithm. The second method is a purely feedforward D-T network procedure whose architecture derives from the recurrence relations for the derivatives of the trajectories of a Ricatti format dynamic system. It can also be used for the modeling of discrete series in terms of nonlinear mappings. Both methods have been tested successfully against chaotic series. 1 Introduction
In this paper we will consider the problem of predicting the future evolution of a certain D-dimensional vector f = (fi, . . . , f ~knowing ) its past behavior. More precisely, it is assumed that f depends continuously on a given one-dimensional variable t, often taken as time, and that a certain number K of past samples f-i = f(t-i), i = 1,. . . ,K, are known up to time to, with usually equally spaced times, i.e., t-i = to - ir. It is then desired to forecast the future values of f at times t, = to jr, j = 1,. . . ,L. As stated, this problem has been widely studied and several a p proaches have been proposed for its solution (see for instance Box and Jenkins 1970; Gabor et al. 1960; Farmer and Sidorowich 1987). Of particular interest here are the so-called dynamic systems methods, which in general consider the trajectory vector f as the evolution of a dynamic system, that is, as the solution of an ordinary differential equation (ODE)
+
X
= F(x,W)
(1.1)
where F = (FI,. . . ,FD) denotes a D-dimensional function of D variables; obviously, a coupling relationship is assumed among the D individual components of x. The parameters w determine a specific realization Neural Computation 5, 795-811 (1993) @ 1993 Massachusetts Institute of Technology
Vicente Mpez, Rambn Huerta, and JoseR. Dorronsoro
796
within the functional approximation model given by F and they are to be adjusted, usually by least-squares minimization of the distance between the known past trajectory of f and a particular solution x of equation 1.1. This point of view has recently received great attention, either as stated (Eisenhammeretal. 1991) or recast as the problem of obtaining a mapping giving certain sections of the dynamical system evolution (Crutchfield and McNamara 1987). The time series prediction problem has also aroused considerable interest in the neural network research community. In particular, the training procedures for the well-known continuous Hopfield models (Hopfield 1982) require the target driven parameter fitting of ODE systems written in the Hopfield format. Although in principle this model building was done to obtain adequate mappings between static inputs and outputs, the training target values were essentially taken as constant trajectories that the underlying dynamic system tried to match. This point of view readily suggests that the same training methods can be used to obtain ODE-based recurrent networks capable of learning certain space state trajectories (Pineda 1987; Pearlmutter 1989; Williams and Zipser 1989; Toomariam and Barhen 1991). In these methods the training set contains a number of coupled trajectories, whose step i targets are clocked back to the network as step i + 1 inputs. These networks are then indeed capable of learning certain state space trajectories; on the other hand, their modeling scope seems to be limited by the somewhat restricted nature of the ODE format employed. It is clear that the predictive power of a model such as equation 1.1 will strongly depend on the ability of the chosen functional model to approximate adequately general D-dimensional functions. Here (see also L6pez and Dorronsoro 1991) we will use as a functional model a polynomial function of fixed degree. More precisely, we will replace each of the D components of the general multivalued function F in equation 1.1 by a polynomial function P on D variables with degree K, that is, Xi
= wi
+ C W ; X ~+ C wj,j2xjlxj2
...
i
il92
Other choices are possible (see, e.g., Cremers and Hubler 1987) but this format, sometimes called the universal polynomial format, is a natural one under various points of view. First, it can be viewed as a Kth order Taylor approximation to the general function F in equation 1.1, giving for reasonable F and high enough K a good numerical approximation. It also encompasses a rich family of possible differential equations (Kerner 1981). On the other hand, for D-dimensional trajectories, essentially DK+' free coefficients have to be adjusted, which for large values of D sharply limits the allowable choices of K. As a compromise we will concentrate
Polynomial Modeling of Coupled Time Series
797
here on the case K = 2, that is, the so-called Ricatti format, i j
= ui
+ c bijxj + c c@jxk i
(1.2)
E k
for which the exact number of free coefficients is then 02(0+ 1)/2+ 02 + D = D3/2 O(@). Although restricted, this format also turns out to be quite rich; for instance a number of systems with chaotic dynamics fall into its scope, such as the Lorenz, Rossler, or Henon-Heiles equations (Lichtenberger and Lieberman 1983). Also we have found that the addition to the original trajectories of a few of its own successive powers enables a Ricatti model to yield a good approximation to a general analytical F. We will study in Section 2 the target driven Ricatti modeling of a given set of coupled time series, which, as mentioned before, can be interpreted in terms of learning in a recurrent neural network. In Section 3 we will propose a purely feedforward alternative approach in terms of a network. Its particular architecture will be derived after the consideration of partial Taylor series summation for the integration of the Ricatti ODES. In Section 4 we will illustrate numerically both approaches for the modeling of ODE generated time series derived from the so called Morse equations of molecular dynamics and from the very well known Lorenz equations. We will also consider the applicability of these techniques to series generated from discrete mappings using as an example the well known Henon map. Finally, in Section 5 we will summarize the main results of the paper.
+
2 Ricatti Recurrent Modeling
We assume that the known coupled trajectories of f are given as a certain function of time f(t) for values of t on an interval [t-K, to], and that predictions are desired for the values of f on the interval [to, t ~ ]We . want to determine a set of parameters wosuch that the trajectory xo = Q(t, WO), obtained by integrating equation 1.1 for these values of w and initial conditions x ( f - K ) = f(t-K), minimizes a certain error measuring function E(f, x), which in our examples will be
The most effective numerical procedures to find the minimizing wo require gradient information. Although for the E above several methods can be devised, there is an efficient gradient computation procedure that uses the constraints on x imposed by equation 1.1. It was first proposed by Sat0 (1990a) as a teaching algorithm for recurrent Hopfield networks and then used in Sat0 (1990b) for the learning of spatiotemporal patterns by the same networks. In any case the method can also be applied to the
798
Vicente Mpez, Ram6n Huerta, and Jose R. Dorronsoro
present situation. The main idea is to add an extra, 0-valued term to the definition of E above, namely,
where the z = ( 2 1 , .. . ,ZD) is an auxiliary set of trajectories, which is to be chosen to ease the gradient computation. For a Ricatti format system equation 1.2 these differential equations for z are found to be
with end point conditions Zi(t0) = 0. The gradient is then computed by the formulas
The computational benefits of this approach are now clear: we just integrate two systems, equations 1.2 and 2.2, and then perform the @ / 2 integrals equation 2.3. The cost of obtaining the gradient in this fashion is then essentially of the order of magnitude of 03. I We conclude this section with some considerations on the numerical implementation of the above algorithm. First it must be noted that the integration of the systems equations 1.2 and 2.2 is usually numerically unstable. This is particularly so in the early phases of the minimization process, which give rise to large gradients and, therefore, to abruptly changing equations. The breakup of moderate to large intervals [ t - K , to] into smaller ones is thus needed to avoid overflow problems. There is another, deeper reason for this breakup. In our numerical simulations we always started with small, uniformly distributed initial values for w. This often gives as first approximations to F highly dissipative systems. Their solutions evolve rapidly to constant numbers, usually the mean values of F on the training interval and, therefore, independent of w. Hence, even for relatively small intervals, the procedures easily tend to fall in local minima far away from an appropriate F. Thus, to avoid this constant, 0-valued behavior of &/aW over most of the training interval, it is again necessary to use small subintervals where the dissipation does not set in completely. Finally, notice that the integration of equation 2.2 has to be performed backward in time, which requires the memory storage of the xi. Therefore, the minimization of this storage adds a third reason to break up the training trajectory. In our simulations the breakup was done as follows. We will assume that the past behavior of f is only known as a finite number K of discrete, 'In contrast, the more straightforward gradient computing procedure of taking partials with respect to w in equation 2.1 and integrating the ODES for &(f aU derived from equations 1.2 requires on the order of 06 operations.
Polynomial Modeling of Coupled Time Series
799
equally spaced samples f-k = f(t-k), k = 1,. . . ,K. We will thus divide the training interval [t-K, to] in K subintervals [to - k T , to - (k - 1 ) 4 k = 1, . . . ,K. We will also replace the continuous mean sum of squares error (equation 2.1) by its discrete counterpart i
X-1
D
Moreover, the numerical ODE integration will also proceed in a discrete fashion, stepping forward from time t-k to time t-k+l to obtain xi(t0 (k - 1)~)from f i ( t 0 - k T ) and backward to obtain zi(to - k T ) from q ( t 0 (k - 1 ) ~ )= 0. In the following section, the analysis of this concrete algorithmic implementation will lead us to an alternative approach, based in feedforward u-T networks, to the above recurrent modeling. 3 Feedforward Modeling
Partial Taylor series summation provides a natural integrating device for Ricatti ODES (Fairen et al. 1988) because of the simple recurrence for computing the successive derivatives of x. Each xi(to - ( k - 1 ) ~ )can be approximated to order Q as (3.1)
starting with x y ) ( t o - k T ) = f i ( f o - k T ) ; here x p ) stands for the qth derivative of X i , which can be recurrently evaluated by means of the formulas XP’(t)
= ai6Iq
+
cb i j % y ’ ( t )
with blq being 1 when q = 1 and 0 otherwise. Now, observe that if we conceptually place for each q the x p ) in a single layer, the computation in equation 3.2 corresponds for each i to the weighted sum of linear outputs coming from all the units in the previous 9 - lth layer, and quadratic outputs from pairs j , p of units in layers 1, m such that l m = q - 1 (for q = 1 there is an extra term that can be seen as coming from an outside unit with constant 1 activation). Moreover, the outcome of equation 3.1 is for each i the sum of the activations of the i units in the just defined Q + 1 layers, each one constantly weighted by ~ q / q ! In . other words, the outcome of a Qth order Taylor series integrator as given by equations 3.1
+
800
Vicente L6pez, Ram6n Huerta, and JoseR. Dorronsoro
and 3.2 can be formally seen as the output of a CT-T feedforward network with 1. one input layer, corresponding to 0 order derivatives;
2. Q intermediate layers, corresponding to order q derivatives, 1 I q 5 Q, with linear connections with all the units in the preceding q - lth layer and quadratic connections from pairs of units in layers I, m such that I + m = 9 - 1 and connecting weights constrained as in equation 3.2; 3. one output layer whose ith unit is connected to the ith units of all previous qth layers with fixed weights 7 4 / q ! . Figure 1 contains the connections diagram of a network performing firstorder Taylor integration of a two-trajectory system. Once these outputs xi(to - k7 + 7)have been computed, the interval errors E k can be approximated as before by
where we have used the notation xi(t0 - k 7 ) = xi(k),fi(to- k7) = fi(k). This form of the error readily identifies as the usual backpropagation total sum of squares error between the targets f(k) and outputs x(k) corresponding to inputs f(k - 1). As it is now clear, this concrete numerical implementation lends itself to an alternative modeling device: we simply consider all the values connecting the first Q 1 layers in the above network as free parameters (of course, unit activations cannot be taken any longer as derivatives) and view the network output as given by a particularly structured feedforward network. Learning is then performed by means of the usual, well-known backpropagation algorithm, either on line or in batch mode. This approach offers clearly greater speed in the gradient computations, although with a drawback, due to the fact that the number of weights to be adjusted depends now not only on the dimension D of the coupled trajectories but on the order Q of the Taylor series to be used: if great accuracy is needed, Q will have to be increased and so the dimension of the weights search space. In principle, this could make these networks vulnerable to the well-known phenomenon of overfitting: the perfect memorization of the training patterns but the inability to deal with other inputs not seen before. Nevertheless, the value of the time series networks presented here depends essentially on their prediction abilities: if they are capable of providing adequate medium range forecasts, we can conclude with a certain amount of confidence that they are not hampered by an excessive number of weights. We will illustrate this issue on our numerical examples: relatively high values of Q were needed for them
+
Polynomial Modeling of Coupled Time Series
801
Figure 1: Connections diagram for first-order Taylor integration of a twotrajectory system (an extra unit with constant activation 1 and connected to the intermediate layer has not been depicted). Circles with a + represent additive units; dotted ones multiplicative units. Solid circles denote weight multiplication (a few of the corresponding coefficient labels are shown); four of these weights have fixed values 1and 7. This network is used in the Henon trajectory simulation of Section 4. to successfully model the training trajectories, yet the resulting networks still gave good predictions. A second advantage of these feedforward networks is that, unlike their recurrent counterparts, they can successfully be also used to study discrete time series even when a continuous time evolution cannot be assumed. In this situation an ODE based approach is bound to fail but these series can sometimes still be studied in terms of mappings, that is, the k + 1 term xk+l being generated as @(xk), with @ the mapping function we want to model. Used with a time step of 1, the feedforward networks just mentioned above provide precisely a particular mapping 9 that from input xk tries to adjust the step k+ 1 target xk+l. This target is then clocked back to the network as input and will be mapped by 9 into an approximation to the next target, and so on. In particular, an order 1 feedforward network defines a (nonhomogeneous) quadratic mapping in
Vicente Lbpez, Ram611Huerta, and JoseR. Dorronsom
802
terms of the components of the xk. We will also illustrate this approach in the next section. Finally, and although we will not pursue it here any further, there is a third, potentially very important advantage in this feedforward alternative. In contrast with the conjugate gradient or quasi-Newton methods for numerical optimization, backpropagation networks lend themselves naturally to hardware implementations; obviously this is more true for the above u--T architecture, since it just involves additions and multiplications. 4 Numerical Results
We will now apply the above techniques to three numerical examples of coupled time series derived from the so-called Morse equations, and from the well-known Lorenz equations and Henon mappings.
4.1 Morse Trajectories. The first example involves the Morse equations of molecular physics. Although we will not discuss them in detail, we point out that they arise as equations of motion of model Hamiltonians used in many studies of molecular oscillations. Here we will consider the two-trajectory system
9 = P
i
= -2(1 - e-4)e-q
where 9 would denote bond length; notice that it is very far from being of Ricatti type. The trajectory behavior depends on its total energy H. In our example, if the total energy H is << 1, the solutions oscillate almost harmonically; however, when H is near 1, the oscillations approach a spine-like separatrix located at H = 1 (Fairen et al. 1988) behind which they stop being periodic. As it may be expected, the trajectories near the separatrix are very sensitive to numerical errors. As an a priori measure of the number of powers of 9 needed to reconstruct a given trajectory, we can consider the potential V(q)that governs the behavior of the Morse trajectories, as depicted in Figure 2, which shows that the value V = 0 divides the bounded and unbounded trajectories. Figure 2 also depicts a sixth-order approximation to V; the three dotted lines correspond, from top to bottom, to the values 0.95, 0.8, and 0.6 of H. From that figure it would appear that this Taylor approximation would be barely adequate in the H = 0.6 case but would fail when H = 0.8 and more so when H = 0.95. However, as we shall see, for this last H = 0.95 case, using as training trajectories the values of r = (9, p, 9’) on 200 points sampled each 0.1 time units (slightly over one period), we still obtain a good reconstruction. In the recurrent case we started all our experiments with small (absolute values less than 0.001) initial coefficients uniformly distributed
Polynomial Modeling of Coupled Time Series
803
2 1.5 1 0.5
0 -0.5 -1 -1
0
-0.5
0.5
1 q
1.5
2
2.5
3
Figure 2: Morse potential (continuous drawing) and sixth-order approximation (dots). Table 1: Recurrent and Feedforward Modeling of the H = 0.95 Trajectory.’ Number of steps Model
RM
E/L R2
FM
80
160
240
320
400
1.2 x 10-4
1.3 x 10-4 1.oooO 6.3 x 10-5
2.6 x 10-3 0.9998 4.8 x 10-5
2.7 x 10-3
6.0 x 10-3 0.9997
1.m
E/L
1.9x 10-5
R2
1.oooO
1.m
1.m
0.9999 6.5 x 10-5 1.OoOo
9.8 x 10-4 0.9999
O R M = recurrent model; FM = feedforward model.
around 0 and proceed to minimize the error function using the quasiNewton routines given in Press et al. (1986). The procedures took on the average about 75 iterations to converge to parameter values yielding in all cases a global error of the order of 5 x lop7. The corresponding parameter sets were in fact quite similar: the equations corresponding to 9 and q2 were always modeled as essentially q = p and L$ = 2qp, with the other absolute values being smaller than 5 x The equation of p was essentially
lj = 2 - 2q - p” + 0.89’ - 0.15q3+ 0.01q4 the remaining terms having absolute values also in the above range.
Vicente L6pe2, Ram611 Huerta, and JoseR. Dorronsoro
804
To test the model quality we then proceed to integrate the resulting Ricatti models over approximately two periods (400 points) starting at a trajectory point. A natural measure of the prediction error is given in this case by the squared error E
E=
J [ ( q - 9)’ + ( p - p)’ + (9’
- ?)2]
dt
that is, the distance between the correct values of q, p, and q2 and their corresponding integration values. Table 1 shows for an increasing number of steps the value of E / L , that is, this squared error divided by L, the length of the corresponding 9, p, q2 subtrajectory, given for a time interval [TO,TL]as L=
R
(9’
+ j? + 4qz92)1’2 dt
It also shows for each interval the empirical values R2 of the quantity R2 = 1- Var(E)/Var(r),with Var(E) denoting the residual variance and Var(r) the trajectory variance: Var(e) =
R2 is often used to weight the modeling abilities of a given procedure; values of R2 near 1 usually indicate good agreement between real and predicted trajectories, with small values giving evidence of poor fitting. As it can be seen from the table values, the agreement between the correct and modeled values is rather good. Figure 3 presents the correct q-p 400 step trajectory and its approximation. We also applied the feedforward alternative to this trajectory using a network mimicking a fourth-order Taylor expansion. Starting at the same trajectory point as above, we allowed the resulting model network to provide a 400 point forecast, clocking the successive network outputs back as the next inputs. Table 1 also gives the error and R2 evolution; again, the model fitting can be considered quite satisfactory. 4.2 Lorenz Trajectories. These very well-known trajectories are obtained integrating three-dimensional systems of the general form
x
= -u(x-y)
lj
=
2 =
bx-y-x~
(4.1)
-cz+xy
Our examples are obtained for the values u = 10, b = 28 and c = 8/3;
the resulting trajectories present turbulent dynamics (Bender and Orszag 1978). As is readily seen, the Lorenz equations are already in Ricatti format. This would seem to imply that they would be easier to model using our procedures and, therefore, that the same would be true for trajectory
Polynomial Modeling of Coupled Time Series
805
1.5 1 0.5
0 -0.5 -1 -1.5 -1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
Figure 3 H = 0.95 Morse trajectory (continuous drawing), and recurrent (circles) and feedforward (dots) approximations. Table 2: Recurrent and Feedforward Modeling of the Lorenz Trajectories. Number of steps Model
20
40
60
80
100
2.96 20.67 0.31 1.70 E I L 0.004 Recurrent R2 1.OOOO 0.9927 0.9591 0.9277 0.4967 model 2.22 20.13 0.26 1.20 Feedforward E I L 0.0014 R2 0.9999 0.9938 0.9711 0.9456 0.5096 model
prediction. On the other hand, the well-known numerical sensitivity of these equations must be taken into account; in particular, it is impossible to make good long term forecasts of their evolution. We will discus two numerical experiments on the Lorenz equations. In the first one we integrated the above equations starting at x = y = z = 10 with a step length of 0.001 and sampled the resulting trajectories once every 50 steps. In this way we obtain a set of 300 points with a time separation of 0.05 units, of which we used the first 200 points for modeling purposes and the remaining 100 ones to test the model equations. In the recurrent case the main terms of the resulting system matched those of equation 4.1 for the values of a, b, and c given above after rounding to two significant digits; the remaining coefficients had values less than 0.2. We tested the forecasting abilities of these equations performing a 100 step integration from the last set of the training set comparing the trajec-
Vicente L6pe2, Ram6n Huerta, and JoseR. Dorronsoro
806
I
I
I
I
1
I
I
I
I
20
40
60
80
100
15 10
5 0
-5
-10 -15 -20
'
0
Figure 4: Lorenz x trajectory (continuous drawing; correct values are given by the end points of the vertical lines) and recurrent (circles) and feedforward (asterisks)approximations.
tories with those of the 100 point test set. The evolution of the growth of the squared error E between the correct values of x, y, and z and their corresponding integration values, divided by subtrajectory length L, and that of R2 is given in Table 2. These results show a good prediction for about 75% of the test trajectory that rapidly deteriorates afterward. A feedforward network based on a 5th-order Taylor approximation was also employed to model the training trajectories. Just as was done in the recurrent case, the error and R2evolution for the same increasing size intervals are also given in Table 2, which was obtained by clocking the network outputs back as the next inputs starting again at the last point of the training set. The results are quite similar to those of the recurrent case, and the original x trajectory and the predicted one are shown in Figure 4. Our second Lorenz experiment was motivated by the fact that, if the system to be modeled is known to be derived from a Ricatti equation and good estimates of the trajectory derivatives are known, the system coefficients can simply be obtained by treating equation 1.2 as a linear equation with the ui, bij, and cijk as unknowns. Thus, a possible modeling procedure could be to compute for each point of the training interval the values of ui, bij, and c ~ k and , then get global values for them, for instance, averaging the pointwise values over the full interval. This approach would certainly be quite expensive computationally: because there are essentially P/2 unknowns, the solution of that linear equation would
Polynomial Modeling of Coupled Time Series
807
Table 3: Recurrent Modeling of the Lorenz Trajectories with Noise Added. Number of steps
EIL
10
20
30
40
50
60
0.27
2.07
1.94
3.15
30.2
47.02
R2 0.9943 0.9507 0.9525 0.9257 0.2668 -0.1299 have a cost on the order of ( D 3 / 2 ) 3 / 3= D9/24operations, although they would be performed only once? In any case, if the derivatives cannot be adequately computed, the linear procedure is bound to fail. This is the case, for instance, if numerical differentiation is to be used and the training trajectories have noise of a certain amplitude 6. Then, on top of the error inherent in the numerical approximation of the derivatives, there may also be a potentially much larger computational error, due to the use of x + 6 instead of the right value x. In fact, if one-sided difference quotients with step h are used, this computational error is of the order of 26/h (Conte and de Boor 1986). In our case h = 0.05 and, therefore, a noise of amplitude 1 may produce error values of up to 40. Since the derivatives of the training Lorenz trajectories have average absolute values about 50, this error may yield derivative values with an 80% error, making unusable the linear equation approach. However, our recurrent procedure still manages to provide acceptable modeling and prediction (although of course worse than in the noise free case), as shown in Table 3. We point out that, for dynamical systems such as the Lorenz with large Lyapunov exponents, prediction does not make much sense if used on a true Lorenz trajectory to which noise has been added: even the trajectories obtained from the right equations will diverge fast from the ones without noise if started on a noisy point. For this reason the values in Table 3 are derived from the same noise free test set used in our previous Lorenz examples. Predictions are adequate now for about the first 40 points, deteriorating afterwards. Figure 5 depicts the evolution of the correct and predicted x trajectories. 4.3 Henon Trajectories. A feedforward network is not capable of adequately modeling noisy Lorenz data, the reason being its overfitting of the noise in the training trajectory. We have instead used these networks to study discrete mappings, as mentioned at the end of Section 3. As a final example we will briefly discuss the feedforward modeling of
*In our experiments optimization was achieved typically after 100 iterations; since for the Lorem equations D = 3, the linear equation approach would still be twice as fast as gradient descent. However, for D = 4 gradient descent would be twice faster and ten times so for D = 5.
Vicente Lbpez, Ram6n Huerta, and JoseR. Domnsoro
808
20
I
I
I
I
I
I
I
I
I
15 10 5 0 -5 -10
-15 -20
1
I
I
I
I
1
I
I
I
I
0
10
20
30
40
50
60
70
80
90
100
Figure 5: Lorenz x trajectory (circles)and recurrent approximation from noisy data (dots). Table 4 Feedforward Modeling of the Henon Trajectories.
Number of steps 5
10
1.OOOO
3 x lo-’ 1.0000
E 4x
R2
15
20
25
30
2x 3x 0.09 6.05 1.oooO 0.9997 0.9926 0.5895
the well-known HCnon mapping (see Lichtenberger and Lieberman 1983, chapter 7) Xn+1
=1
+ yn - 1.4X2,,
yn+l
= 0.3xn.
Starting at xo = yo = 0 we obtained 100 points that we used to adjust the output of an order 1 feedforward network using 1 as time step. As done before, the resulting mapping was used to generate a 100 point trajectory starting again at xo = yo = 0. Table 4 presents the evolution of E (this time not divided by trajectory length) and R2 with increasing number of steps and shows that the actual and predicted evolutions coincide for about 25 points, rapidly disagreeing afterward. As also happens with the Lorenz equations, this cannot be otherwise since very small deviations between actual and predicted values at a given point are amplified very fast to produce different pointwise evolutions. However, it must be noted that the long range behavior of the model trajectories is very close to that of the HCnon trajectories. This predicted evolution is depicted in Figure 6, which gives a 500 point section of 10,000 iterations of the approximating mapping, showing great
Polynomial Modeling of Coupled Time Series
809
0.4 0.3
0.2 0.1
0 -0.1 -0.2 -0.3 -0.4 -1.5
-1
-0.5
0
0.5
1
1.5
Figure 6: 500 point section of 104 iterations of the recurrent approximation to the Henon mapping. similarity to the characteristicalleaves of the Henon attractor (see Lichtenberger and Lieberman 1983, p. 391). 5 Discussion
Time series forecasting is a problem faced in different branches of science and engineering. A common approach in basic science is to consider time series as solutions of differential equations derived from first principles and it is normally assumed that noise has already been filtered. In many applications, the complicated source of the series precludes this approach and it has to be assumed that all of the information available is just the collection of data, usually noisy, entering the series. However, even in this situation time series can be considered as solutions of some differential equations but the task of their identification can be too difficult. In fact, the objective is reduced to finding an approximation to such equations that could yield short time or limited accuracy forecasting. It is at this point where neural networks can play a role. They are expected to provide useful approximations to the ODEs governing the time series. A key step toward this goal is the design of neural network architectures capable of encompassing as general a class of ODEs as possible. The search for such architectures should take advantage of studies done in nonlinear dynamics towards universal formats for ODEs. In this work we have focused our attention on the so-called Universal Polynomial Formats and particularly on the Ricatti Format, to which higher degree
810
Vicente L6pez, Ram611 Huerta, and JoseR. Dorronsoro
formats can be reduced. O u r first result is that recurrent neural networks based on this format can learn with high accuracy time evolving trajectories, even if produced by other equation formats and we also propose a rather effective numerical training method. On the other hand, sampling requirements usually force us to deal with discrete series that do not readily fit the continuous behavior underlying this recurrent approach. These series are better studied by means of discrete mappings, the problem now being to decide the concrete mapping format to be used. The easy integration of Ricatti ODEs by means of Taylor series expansions and its very natural interpretation in terms of a feedforward network (Dorronsoro and L6pez 1991) has led us to introduce a discrete mapping alternative to continuous recurrent modeling, based on a purely feedforward, u--'IT,architecture. Besides being very natural and well-suited to these discrete series, and easily parallelizable in hardware, it also allows efficient and numerically stable learning procedures. Finally, we want to point out that in this study we have assumed that the problem series contain enough information to make their modeling possible and therefore we have not considered in detail the deep questions of constructing extra coupled training trajectories when, for instance, just a single series is known. Our Ricatti model allows the straightforward procedure of building up products of the original series to generate complementary trajectories. Actually, this procedure is in a one-to-one correspondence with the standard practice followed to reduce to lower order higher degree polynomial ODEs. We are currently working on the situation where some of the series relevant to the system under consideration are unknown and also on the filtering issues that arise when studying noisy data. They will be published elsewhere but, in any case, it is our opinion that the proposed recurrent and feedforward procedures represent a very natural, easy to implement, and quite promising first approach to the forecasting of coupled time series, a problem where not too many methods are available. Acknowledgments This work has been partially supported by the Spanish Programa Nacional de Tecnologfas de la Informaci6n y las Comunicaciones, Grant 313/90, and also by the Asociacidn para el Desarrollo de la Ingenieria del Conocimiento. References Bender, C. M., and Orszag, S. A. 1978. Advanced Mathematical Methods for Scientists and Engineers. McGraw-Hill, New York. Box, G . E. P.,and Jenkins,G. M. 1970. Time Series Analysis, Forecasting and Control. Holden-Day, San Francisco.
Polynomial Modeling of Coupled Time Series
811
Cremers, J., and Hiibler, A. 1987. Construction of differential equations from experimental data. Z. Naturforsch. 42a, 797-802. Conte, S. D., and de Boor, C. 1986. Elementary Numerical Analysis. Mdraw-Hill, New York. Crutchfield, J. P., and McNamara, B. S. 1987. Equations of motion from a data series. Complex Systems 1,417-452. Dorronsoro, J. R., and Lbpez, V. 1991. Formal integrators and neural networks. “Neural Networks.” Proceedings of the X I Sitges Conference, Lecture Notes in Physics. Springer-Verlag,Berlin. Eisenhammer, T., Hiibler, A., Packard, N., and Kelso, J. A. S. 1991. Modeling experimental time series with ordinary differential equations. Biol. Cybern. 65, 107-112. Fairen, V., Lbpez, V., and Conde, V. 1988. Power series approximation to solutions of nonlinear systems of differential equations. A m . I. Phys. 56, 57-61. Farmer, J. D., and Sidorowich, J. J. 1987. Predicting chaotic time series. Phys. Rev. Lett. 59,845-848. Gabor, D., Wilby, W. P. L., and Woodcock, R. 1960. A universal non-linear filter, predictor and simulator which optimizes itself by a learning process. Proc. IEE 108B, 422-438. Hopfield, J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A.79, 2554-2558. Kerner, E. D. 1981. Universal formats for nonlinear ordinary differential systems. I. Math. Phys. 22, 1366-1371. Lichtenberg, A. J., and Lieberman, M. A. 1983. Regular and Stochastic Motion. Springer-Verlag,Berlin. Lbpez, V., and Dorronsoro, J. R. 1991. Neural network learning of polynomial formats for coupled time series. Proc. ofthe International Conferenceon Artificial Neural Networks (ICANN-911, Vol. I, 201-206. North Holland, Amsterdam. Pearlmutter, B. 1989. Learning state space trajectories in recurrent neural networks. Neural Comp. 1,263-269. Pineda, F. 1987. Generalization of back propagation to recurrent and higher order neural networks. Proceedings of the IEEE Conferenceon Neural Infomation Processing Systems. IEEE Press, New York. Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vettering, W. T. 1986. Numerical Recipes in FORTRAN. Cambridge Univ. Press, Cambridge. Sato, M. 199Oa. A real time learning algorithm for recurrent analog neural networks. Biol. Cybern. 62,237-241. Sato, M. 1990b. A learning algorithm to teach spatiotemporal patterns to recurrent neural networks. Biol. Cybern. 62,259-263. Toomariam, N., and Barhen, J. 1991. Adjoint functions and temporal learning algorithms in neural networks. In Advances in Neural Processing Systems 3, pp. 113-120. Morgan Kaufmann, San Mateo, CA. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1,270-280. Received 23 July 1992; accepted 28 January 1993.
This article has been cited by: 2. Shun-Feng Su, Chan-Ben Lin, Yen-Tseng Hsu. 2002. A high precision global prediction approach based on local prediction approaches. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 32:4, 416-425. [CrossRef]
Communicated by Douglas Miller
Attraction Radii in Binary Hopfield Nets are Hard to Compute Patrik Florden Pelcka Orponen Department of Computer Science, University of Helsinki, SF-00024 Finland
We prove that it is an NP-hard problem to determine the attraction radius of a stable vector in a binary Hopfield memory network, and even that the attraction radius is hard to approximate. Under synchronous updating, the problems are already NP-hard for two-step attraction radii; direct (one-step) attraction radii can be computed in polynomial time. A Hopfield memory network (Hopfield 1982) consists of n binary valued nodes, or "neurons." We index the nodes by (1,. . . ,n } , and choose { -1, +1} as their possible states (the values (0,l) could be chosen equally well). Associated to each pair of nodes i, j is an interconnection weight wi,. The interconnections are symmetric, so that wq = wji for each i,j; moreover, Wii = 0 for each i. In addition, each node i has an internaI threshold value ti. We denote the matrix of interconnection weights by W = (wij), and the vector of threshold values by t = ( t l , t 2 , . . .,tn). At any given moment, each node i in the network has a state x i , which is either -1 or +l.The state at the next moment is determined as a function of the states of the other nodes as X i := sgn(& WijXj - ti), where sgn is the signum function [ s g n ( x ) = 1 for x 2 0 and sgn(x) = -1 for x < 01. In the synchronous network model, this update step is performed simultaneously for all the nodes. Thus we may denote the global update rule for the network as x := sgn( Wx - t), where x = ( ~ 1 ~ x 2. ., ,.x n ) is the vector of states for the nodes. It is known (Goles et al. 1985) that, starting from any initial vector of states, a sequence of such updates will eventually converge either to a stable vector [i.e., a vector u such that sgn( Wu - t ) = ul, or to a cycle of length two [i.e., vectors u # v such that sgn( Wu - t ) = v and sgn(Wv - t ) = u ] . We are interested only in the stable solutions. In an asynchronous network, the update step is performed for one node at a time in some (usually random) order. As shown in Hopfield (19821, Neural Computation 5,812-821 (1993) @ 1993 Massachusetts Institute of Technology
Attraction Radii in Binary Hopfield Nets
813
in a network with wii = 0 for each i, all asynchronous computations eventually converge to stable vectors. The attraction radius of a stable vector u is the largest Hamming distance from within which all other vectors are guaranteed eventually to converge to u. The k-step attraction radius is the largest distance from within which all other vectors converge to u in at most k update steps. As the main interest in Hopfield nets is in their error-correcting capacity with respect to the stable vectors, it would be of considerable importance to be able to determine the attraction radius of a given stable vector. However, we show that this problem is NP-hard for both the synchronous and the asynchronous networks, and thus not solvable by a polynomial time algorithm unless P = NP. Even more, we show that the attraction radius of a given stable vector (of length n) cannot be even approximated in polynomial time within a factor nl-c for any fixed E > 0, unless P = NP. [Similar results for other problems arising in the context of analyzing Hopfield nets have been obtained in Florkn and Orponen (1989) and Godbeer et al. (1988). For general introductions to computational complexity issues in neural networks, see Orponen (19921, Parberry (1990), and Wiedermann (19901.1 We start by examining the easier case: the synchronous network. As will be seen, the boundary between tractability and intractability is here located between computing direct (one-step) and two-step attraction radii. We first observe that the former can be computed in polynomial time. Theorem 1. The problem “Givena synchronous Hopfield network, a stable vector u, and a distance k, is the direct attraction radius of u equal to k?” is polynomially solvable.
Proof. The following polynomial time procedure determines the direct attraction radius of u: 1. Radius := n;
2. For each node i
a. compute the values wipj for j = 1, . . ., n and order them as a1 2 2 a,; b. sum := cjaj - ti % this is the total input to node i; c. if ui = 1 then i. k := 1; ii. repeat s u m := sum-2ak; k := k+ 1 until sum < 0 or a k I 0 or k = n 1; iii. if sum < 0 then radius := min{radius, k - 2 ) d. if ui = -1 then i. k := n; *
*
a
+
Patrik Floreen and Pekka Orponen
814
ii. repeat sum := sum-2ak; k := k- 1 until sum 2 0 or ctk 2 0 or k = 0; iii. if sum 2 0 then radius := mintradius, n - k - 1) 3. Return (radius).
Intuitively, we check for each node how many of its inputs must be altered to change its state in the update. The minimum of these numbers is the distance to the nearest vector that results in something else than u. If this distance is k, the radius of direct attraction is k - 1. 0 Next we consider the problem of computing the asymptotic attraction radius. Note that this problem is in NP, if the weights in the network are polynomially bounded in n. A nondeterministic algorithm for the problem works as follows: Given a vector u and a distance k, guess a vector that is within distance k from u and does not converge to u, witnessing that the attraction radius of u is less than k. When the weights are polynomially bounded, any vector converges to either a stable vector or a cycle of length two in polynomial time (Goles et al. 1985).
Theorem 2. The problem "Given a synchronous Hopfield network,a stable vector u, and a distance k; is the attraction radius of u less than k?" is NP-hard. Proof. We prove that the problem is NP-hard by a reduction from the NPcomplete satisfiability problem SAT: "Given a conjunctive normal form (CNl? formula F with k variables and m clauses, is there a satisfying truth assignment for F?"' In fact, we need a special version of SAT: we require that no clause contains both a variable and its negation, and we require that the number of clauses is greater than the number of variables. It is easy to see that these requirements do not affect the NP-completeness, since clauses with both a variable and its negation can be excluded, and if the number of clauses is too small, we can simply repeat one of the clauses the required number of times. Let i = (XI, . . .,i k ) be some truth assignment not satisfying F. Such an assignment is easy to find: Take, for instance, the first clause and choose values for the variables against their appearance in the clause, that is, if variable xi is in the clause, choose false for it, if xi appears negated in the clause, choose true for it; otherwise the value of the variable does not affect the value of the clause, so choose for instance value false for it. Transform formula F to an equivalent formula F' by adding k times one of the false clauses. This ensures that at least k + 1 of the clauses evaluate to false under i . In the following, rn refers to the number of clauses in F'. 'A CNF formula is a conjunction of clauses c1, c2, . . .,cm, where a clause is a disjunction of boolean variables XI, X Z , . ..,xk and their negations, for example, xl V 7 x 3 V 4 ; and a satisfymg truth assignment is a choice of values (true or f alae) for the variables so that the formula gets value true.
Attraction Radii in Binary Hopfield Nets
815
Now we construct a Hopfield network in such a way that there is a stable vector i corresponding to the truth assignment 2, and unless there is a satisfying truth assignment, all input vectors differing from ii in at most k elements converge to ii. On the other hand, if there is a satisfying truth assignment f, the vector corresponding to f differs from the stable vector ii in at most k elements and does not converge to ii; hence the attraction radius of ii is less than k. In the construction, truth value true is represented by node state +1, and truth value false is represented by node state -1. In the following, we make no distinction between the truth values and the corresponding node states. The network has nodes corresponding to the variables and the clauses, and 2k 2 additional nodes, in total 3k + m 2 nodes. We denote a state vector of the network as ( x ,c, r,s), where subvector x = (XI, . . ., x k ) corresponds to the variables, subvector c = (cl, . . . ,)c, corresponds to the clauses, and subvectors r and s, each of length P = k + 1, correspond to the additional nodes. Let c, be the vector of truth values for the clauses resulting from assignment x . Especially, denote S = ci. The stable vector in our construction will be ii = (2, S, -7, -7). Each r-node represents the conjunction of the clauses represented by the c-nodes; the r-nodes are replicated to guarantee that not all of them can be +1 for vectors within Hamming distance k from the stable vector ii, which has T = -7. The s-nodes work in such a way that as soon as all of them get state +1, their states cannot change any more. Let a = P2 + 1 = k2 + 2k + 2. The Hopfield network is constructed in the following way (see Fig. 1):
+
0 0
+
The threshold value of each node xi is -(am
+ 1)Xj;
The threshold value of each node c, is -a(kj - l), where k, is the number of literals (i.e., variables and negations of variables) in the clause cj;
0
The threshold value of each node ri is P(m - 1);
0
The threshold value of each node Sj is 0;
0
0 0
0
0
There is an edge between each xi and each cj with weight a if literal xi is in clause c, and with weight -a if literal i x i is in clause cj; There is an edge between each c, and each ri with weight P; There is an edge between each ri and the corresponding weight k;
Si
with
The s-nodes constitute a fully connected subnetwork there is an edge between each si and each s, (where j # i) with weight 1; and All other edges have weight 0, that is, they are excluded.
Patrik Florkn and Pekka Orponen
816
x1
XZ
7
xk
thresholds:
Figure 1: The structure of the Hopfield network in Theorem 2. It is easy to check that ii = ( i , Z , -7, -7) is a stable vector in this network. This construction is polynomial in the size of the input formula, and we can now proceed to proving that this is the desired reduction. We prove first that if there is no satisfying truth assignment, all vectors at distance at most k from ii converge to ii, and after that we prove that if there is a satisfying truth assignment i,then vector ( i ,Z, -i,-T) does not converge to ii, and hence the attraction radius of ii is strictly less than the Hamming distance between i and i,which is at most k. 1. Assume there is no satisfying truth assignment. In this case, take an arbitrary input vector (x, c, r, s) with Hamming distance at most k from ii. At the first update step, the states of the x-nodes become X, since the input to node xi is between -am and am, and hence xi gets state sgn[fam (am 1)4] = i i . We use here an abbreviation of type sgn(a fb) for sgn(x), where a - b 5 x I u b. As the threshold values in this way force the x-nodes to get states 5, the states of the x-nodes do not change any more during the computation. The states of the c-nodes get values c,. As there is no satisfymg truth assignment, at least one of the cx-valuesis -1. Recall that at least one cj has initial state -1 even if k c-nodes have their states differing from E. Hence the input to each r-node from the c-nodes is at most P(m - 2), and as there is only one connection from an s-node to each r-node, the s-nodes contribute at most k to the input. Thus the r-nodes get state -1, and the only situation
+
+
+
Attraction Radii in Binary Hopfield Nets
817
in which the r-nodes can get states 1 is when all the c-nodes have state 1. At the second update step the states of the c-nodes become S. This can be seen as follows. As S represents the truth values resulting from 2, the input from the x-nodes is -akj if Sj = -1, and at least -a(kj - 2) if Sj = 1. The input from the r-nodes is between -p2 and p2, SO if Sj = -1, then cj gets state sgn(-akj fp2 + a(kj - l)]= -1, and if Zj = 1, then Cj gets state sgn[-a(kj - 2) f p2 a(kj - I)] = 1. As the x-nodes do not change any more, also the c-nodes do not change any more during the computation. As there is some c-node with state -1, the states of the r-nodes do not change: They are all still -1. As the Hamming distance between ( x , c, r, s) and ii is at most k, at least one r-node and at least one s-node have initial states -1. Thus at the first update step, the absolute value of the input from other s-nodes to each m o d e is at most k - 2, whereas the input from the corresponding m o d e is the state of the r-node times k. This results in the s-nodes having the same states after the first update step as the r-nodes had before the first update step. Consequently, there is at least one s-node with state -1 after the first update step. The first update step results in all r-nodes having state -1. Consequently, all s-nodes get states -1 in the second update step. To sum up: Starting with ( x , c, Y, s), the first update step results in (i,cx, -7, r), and the second update step results in ii.
+
2. Assume that there is a satisfymg assignment f. We show that the input vector (f,S, -7, -T) does not converge to ii, which implies that the attraction radius must be less than k. In the first update step, the x-nodes become 2, each cj gets state sgn[-a(kj - 2) f p2 + a(kj - l)] = 1, and r and s stay -1. In the second update step, the cj-nodes become I., but each ri gets state sgn[pmfk-p(rn-l)] = 1, whiles still stays -7. In the third update step, the r-nodes become -T but each Si gets state sgn(k - k) = 1. Now s stays as it is, since from now on the total input to each si is -k k = 0. The computation has converged to ( i S, , -T, 1)# ii.
+
The proof is now completed.
0
From the construction in the proof, we see that just determining the two-step attraction radius is NP-hard. Computing the direct attraction radius is easy while computing the two-step attraction radius is hard, because for the direct radius it is enough to check the change of one element at a time while for the two-step radius we have to check the changes of the changes. Also approximating the attraction radius is hard. We say that an approximation algorithm to a minimization problem approximates the
Patrik Florken and Pekka Orponen
818
problem within a factor K, if for all sufficiently large problem instances I, the result of the algorithm is at most Kmin(I), where min(1) is the optimal result for instance 1. If a CNF formula is satisfiable, it can in general be satisfied by many different truth assignments. We use the name MIN ONESfor the problem of finding the minimum number of true variables in a satisfying truth assignment. [The analogous maximization problem MAXONEShas been considered in Panconesi and Ranjan (1990).1 We see from the construction of the network in Theorem 2 that the attraction radius is one less than the minimum Hamming distance between vector i and a satisfying vector i. Now, construct from a given instance of SAT a formula in the way described in Theorem 2. For each i i = 1, change all literals xi to xi and all literals xi to xi. Now setting all variables to false yields a nonsatisfying truth assignment for the formula, and (-T,c-T, -1, -7) is the stable vector we consider. Thus, the problem of computing the attraction radius is equivalent to the problem MIN ONESof finding the minimum number of true variables in a satisfying truth assignment to a CNF formula. It is easy to show that there is no polynomial time algorithm approximating MIN ONESwithin a factor K for any fixed K > 1, unless P = NP. Given a CNF formula F with k variables, denote n = LKkJ and add n + 1 new variables z, zl, zz, . . . ,zn.Construct the formula n
G = (FVZ)A
A[(zV T Z ~ )A ( 7 2V Z ~ ) ] i=l
Note that G can be made into a CNF formula by distributing the z in the first conjunct over the clauses of F. Now the number of true variables needed to make G true is either at most k (if F is satisfiable) or n + 1 > Kk (setting zl, ZZ, . . ., zn, and z to true). Consequently, an algorithm approximating MIN ONES within a factor K would in fact decide the satisfiability of formula F. We shall introduce here also a stronger construction, which was suggested to us by Viggo Kann (Kann 1992). For this construction, we need SET minimization problem, the MINIMUMINDEPENDENT DOMINATING which asks for the size of a minimum independent dominating set in an undirected graph. Let (V, E) be a graph, where V is the set of nodes and E C V x V is the set of edges. Then a subset S of V is an independent dominating set if there is no edge between any two nodes in S and every node in the graph is either in S or is a neighbor of a node in S. Magnds Halld6rsson has shown that MINIMUM INDEPENDENT DOMINATING SET cannot be approximated in polynomial time within a factor nl-' for any fixed 6 > 0, unless P = NP (Halld6rsson 1993). Here n is the number of nodes in the graph.
Lemma 1. There is no polynomial time algorithm approximating MIN ONES within a factor nl-€for any fixed c, where 0 < 6 5 1, unless P = NP. Here n is the number of variables in the CNF formula.
Attraction Radii in Binary Hopfield Nets
819
Proof. We prove that a polynomial time algorithm approximating MIN ONES within a factor K would give a polynomial time algorithm approximating MINIMUMINDEPENDENT DOMINATING SET within a factor K. Consequently, MIN ONESis at least as hard to approximate as MINIMUM INDEPENDENT DOMINATING SET, and the claim follows from Halldbrsson’s result. Let ( V ,E ) be a graph with n nodes. Create one variable si for each node Si E V . Note that for simplicity we use the same notation for both the node and the variable. Now we transform the property that a node is in an independent dominating set to the property that the corresponding variable is true. Denote the set of neighbors of node si by Ei = { s, I {si,sj} E E } . Construct the CNF formula
G=
A (si V V sj) A A s,EV
s,EE,
( T S ~v T S ~ )
{s,,q}~E
But every satisfying truth assignment to this formula corresponds to an independent dominating set in the graph and the size of this set is equal to the number of true variables. This completes the proof. 0
Corollary 1. There is no polynomial timealgorithm approximating the attraction radii in a synchronous Hopfield network (with n nodes) within a factor n*-€for any fixed E , where 0 < E 5 1, unless P = NP. Proof. We have already seen that the problem of computing the attraction radius is equivalent to the problem MIN ONESof finding the minimum number of true variables. The claim follows immediately from Lemma 1. 0 The results above for synchronous Hopfield memories can be extended to asynchronous Hopfield memories. In the asynchronous case, the results are valid for the asymptotic attraction radius only; the k-step attraction radius is not interesting. We sketch below how the proof of Theorem 2 must be modified in order to apply for asynchronous Hopfield memories. The nonapproximability result then follows in the same manner as in Corollary 1.
Theorem 3. The problem “Given an asynchronous Hopfield network, a stable vector u, and a distance k; is the attraction radius of u less than k?” is NP-hard. Proof (sketch). The problem in applying the proof of Theorem 2 to the asynchronous case lies in the free update order. To avoid this problem, we add for each clause c, a subnetwork checking that cj has the correct value for the current variables x, that is, the subnetwork computes cj 2 (xi, V xi2 V . - .V xiki)? The results of the subnetworks are used by the 2Note that some variables may appear negated in the disjunction. For simplicity of notation, we assume that cj is of the expressed form.
820
Patrik Florken and Pekka Orponen
r-nodes: Node rj gets value 1 if and only if Cj = 1 and, additionally, the equivalence is satisfied. In order to avoid cheating the equivalence test by choosing suitable initial values, the subnetworks are replicated so that there are k = P - 1 such subnetworks for each cj. Node r, gets value 1 if and only if all the subnetworks connected to it yield 1. Additionally, to avoid cheating by manipulating a small set of the x variables, the result of the equivalence test must be false for the stable state ii. Hence, we extend the equivalence test to [Cj
(Xi,
V Xiz V
*
a
*
V X i , ) ] A (X f
i)
The subnetworks are put between the layer of c-nodes and the layer of r-nodes: Each subnetwork has connections from each x-node and from the corresponding c-node and the result of the subnetwork is used by the corresponding r-node. Thus node rj is connected to node c, and to all the - 1 subnetworks connected to cj; each connection has weight P (see Fig. 2). Each subnetwork has 3k +5 nodes and depth 4; the weights must again be chosen so that nodes to the right cannot influence the result (nodes to the left determine solely the outcome of the update). This
thresholds:
-(dm
+ I)&
-a’(kj - 1)
P’m - P
Figure 2: The network part with direct connections to node c1 in the modified network in Theorem 3 (cf. Fig. 1). Note that there in fact are four connections with different weights between each x-node and each subnetwork (denoted by a square).
Attraction Radii in Binary Hopfield Nets
821
means also that the weight cr must be increased to a’ = kmx(P2 + 2P 3) + P + 1, where k,, is the maximum number of literals in a clause in the formula. Now we can proceed roughly as in Theorem 2. Acknowledgments P. 0. would like to thank Max Garzon for inspiring discussions on the topics of this work. The work of P.F. was supported by the Academy of Finland. References Florkn, P., and Orponen, P. 1989. On the computational complexity of analyzing Hopfield nets. Complex Syst. 3,577-587. Garey, M. R., and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, New York. Godbeer, G. H., Lipscomb, J., and Luby, M. 1988. On the computational complexity of finding stable state vectors in connectionist models (Hopfield nets). Tech. Rep. 208/88, Dept. of Computer Science, Univ. of Toronto. Goles Ch., E., Fogelman-Soulie, F., and Pellegrin, D. 1985. Decreasing energy functions as a tool for studying threshold networks. Discrete Applied Math. 12, 261-277. Halldbrsson, M. M. 1993. Approximating the minimum maximal independence number. JAIST Research Report ISRR-93-0001F, School of Information Science, Japan Advanced Institute of Science and Technology, Japan. Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. U.S.A. 79, 2554-2558. Kann, V. 1992. Personal communication. Orponen, P. 1992. Neural networks and complexity theory. In Proceedings of the 17th International Symposium on Mathematical Foundations of Computer Science, pp. 50-61. Lecture Notes in Computer Science 629. Springer-Verlag,Berlin. Panconesi, A., and Ranjan, D. 1990. Quantifiers and approximation. In Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pp. 446-456. ACM, New York. Parberry, I. 1990. A primer on the complexity theory of neural networks. In Formal Techniquesin Art@cial Intelligence: A Sourcebook, R. B. Banerji, ed., pp. 217268. Elsevier, Amsterdam. Wiedermann, J. 1990. Complexity issues in discrete neurocomputing. In Proceedings on Aspects and Prospects of Theoretical Computer Science, pp. 480-491. Ledure Notes in Computer Science 464. Springer-Verlag,Berlin.
Received 4 May 1992; accepted 2 February 1993.
This article has been cited by: 2. Jiří Šíma , Pekka Orponen . 2003. General-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic ResultsGeneral-Purpose Computation with Neural Networks: A Survey of Complexity Theoretic Results. Neural Computation 15:12, 2727-2778. [Abstract] [PDF] [PDF Plus] 3. Yu G. Smetanin. 1998. Neural networks as systems for recognizing patterns. Journal of Mathematical Sciences 89:4, 1406-1457. [CrossRef]
ARTICLE
Communicated by John Rime1
Analysis of Neuron Models with Dynamically Regulated Conductances L. F. Abbott* Gwendal LeMassont Departments of Physics' and Biologyt, and Center for Complex Systems, Brandeis University, Waltham, M A 02254 USA We analyze neuron models in which the maximal conductances of membrane currents are slowly varying dynamic variables regulated by the intracellular calcium concentration. These models allow us to study possible activity-dependent effects arising from processes that maintain and modify membrane channels in real neurons. Regulated model neurons maintain a constant average level of activity over a wide range of conditions by appropriately adjusting their conductances. The intracellular calcium concentration acts as a feedback element linking maximal conductances to electrical activity. The resulting plasticity of intrinsic characteristics has important implications for network behavior. We first study a simple two-conductance model, then introduce techniques that allow us to analyze dynamic regulation with an arbitrary number of conductances, and finally illustrate this method by studying a seven-conductance model. We conclude with an analysis of spontaneous differentiation of identical model neurons in a two-cell network. 1 Introduction Mathematical models based on the Hodgkin-Huxley approach (Hodgkin and Huxley 1952)describe active neuronal conductances quite accurately over time scales ranging from milliseconds to several seconds. Model neurons constructed from these descriptions (see for example Koch and Segev 1989) exhibit a wide variety of behaviors similar to those found in real neurons including tonic spiking, plateau potentials and periodic bursting. However, neuronal conductances can change over longer time scales through additional processes not modeled by the Hodgkin-Huxley equations. These include modification of channel structure and/or density through biochemical pathways involving protein phosphorylation (Kaczmarek 1987; Chad and Eckert 1986; Kaczmarek and Levitan 1987) and gene expression (Morgan and Curran 1991; Sheng and Greenberg 1990;Smeyne et al. 1992). These processes can be activity dependent. For Nard Computation 5,823-842 (1993) @ 1993 Massachusetts Institute of Technology
L. F. Abbott and Gwendal LeMasson
824
example, when rat myenteric neurons are chronically depolarized they show decreased calcium currents (Franklin et al. 1992). Electrical activity can induce the expression of immediateearly genes like fos over a period of about 15 min (Morgan and Curran 1991; Sheng and Greenberg 1990; Smeyne et al. 1992) and expression of the immediate-early gene ras has been associated with an increased potassium conductance (Hemmick et al. 1992). From these studies it is clear that the biochemical processes that affect membrane conductances act on many different time scales. Relatively fast effects, such as the voltage and calcium dependence of channel conductances, are included in the usual Hodgkin-Huxley descriptions. However, activity-dependent modifications of membrane currents due to slower and less direct processes are not. Unfortunately, there is not enough information available at the present time to build a detailed model of the biochemical processes producing slow modification or, as we will call it, regulation of membrane conductances. However, we feel that it is not too early to try to assess what impact such a process might have on the behavior of neurons and neural networks. To do this we have constructed a simple phenomenological model with slowly varying, dynamically regulated conductances and studied its behavior using computer simulation (LeMasson et al. 1992). The model reveals several interesting features: 0
0
0
0
Starting from a wide variety of initial conductances, the model neurons can automatically develop the currents needed to produce a particular pattern of electrical activity. Slow regulatory processes can significantly enhance the stability of the model neuron to environmental perturbations such as changes in the extracellular ion concentrations. The intrinsic properties of model neurons are modified by sustained external currents or synaptic inputs. In simple networks, model neurons can spontaneously differentiate developing different intrinsic properties and playing different roles in the network.
These features have obvious implications for the development and plasticity of neuronal circuits. Our previous work (LeMasson et al. 1992) relied solely on computer simulation involving a fairly complex neuronal model. In this paper we devise a general procedure for analyzing the process of dynamic regulation. We will examine the properties listed above in detail both for a simple neuron model and for the more complex model considered previously.
Analysis of Neuron Models
825
2 A Model of Dynamic Regulation
We consider a single compartment, conductance-based neuron model with the membrane potential V determined by the basic equation (2.1)
C is the membrane capacitance and Ii are the membrane currents, which are written in the form (Hodgkin and Huxley 1952; see Koch and Segev 1989)
I., - g,mfh?(V - Ei)
(2.2)
where Ei is the equilibrium potential corresponding to the particular ion producing the ith current, pi and qi are integers, and gi is the maximal conductance for the current i. The dynamic variables mi and hi are determined by first-order, differential equations linear in mi and h; but with nonlinear voltage-dependent coefficients,
and
These equations describe the voltage-dependent characteristics of the conductance. Calcium dependent properties can be included by allowing a and p to depend on the intracellular calcium concentration as well as on the voltage. In conventional Hodgkin-Huxley type models, the maximal conductances gi are fixed constants. However, these are likely candidates for the slow modulation that we refer to as dynamic regulation. This is because the maximal conductance of a given current is the product of the conductance of an individual membrane channel times the density of channels in the membrane. Any slow process that alters the conductance properties of the channel or adds or removes channels from the membrane will affect gi.These slow, regulatory processes can be included in the model by making the maximal conductances dynamic variables instead of fixed parameters (LeMasson et al. 1992). Regulatory mechanisms could also, in principle, mod* the kinetics of channel activation and inactivation, but we will not consider this possibility here. To construct a model with dynamic regulation, we need to describe a mechanism by which the activity of a neuron can affect the maximal conductances of its membrane currents. Numerous possibilities exist including modified rates of channel gene expression, structural modifications of the channels either before or after insertion into the membrane,
826
L. F. Abbott and Gwendal LeMasson
and changes in the rates of insertion or degradation of channels. These (and many other) processes often depend on the intracellular calcium concentration (Kennedy 1989; Rasmussen and Barrett 1984; Sheng and Greenberg 1990; Murphy et al. 1991). For example, activity-dependent expression of immediate early genes has been linked to an elevation in calcium levels due to influx through voltage-dependent calcium channels (Murphy et al. 1991) and calcium is implicated in many other examples of slow, activity-dependent modulation (Kennedy 1989; Rasmussen and Barrett 1984; Sheng and Greenberg 1990). In addition, the intracellular calcium concentration is highly correlated with the electrical activity of the neuron (Ross 1989; LeMasson et al. 1992). For these reasons, we use the intracellular calcium concentration as the feedback element linking activity to maximal conductance strengths (LeMasson et al. 1992). Since the maximal conductances gi depend on both the number and properties of the membrane channels, their values will be affected by the processes outlined above. If these processes are regulated by calcium the values of the maximal conductances will also depend on the intracellular calcium concentration. We will assume that the kinetics is first-order and that both the equilibrium values of the maximal conductances and the rate at which they approach the equilibrium value may be calcium dependent. As a result, the behavior of the maximal conductances g, is described by the equations (2.5)
where [Ca]is the intracellular calcium concentration. At fixed intracellular calcium concentration [Ca],the maximal conductance g, will approach the asymptotic value Fi[Ca] over a time of order q[Ca]. If the calcium concentration changes the maximal conductances will also change their values. This regulation is a slow process occurring over a time ranging from several minutes to hours. This time scale distinguishes the calcium regulation of equation 2.5 from the more familiar and rapid calcium dependence of currents like the calcium-dependent potassium current. The full neuron model with dynamic regulation of conductances is described by equations 2.1-2.5 and an equation for the intracellular calcium concentration [Ca]. For the model to work, it is crucial that one of the membrane currents Z, = Zca be a voltage-dependent calcium current because this is what links the intracellular calcium concentration to activity. We will assume that entry through voltage-dependent calcium channels is the only source of intracellular calcium and will not consider release from intracellular stores. Calcium is removed by processes that result in an exponential decay of [Ca]. Thus, [Ca] is described by the equation
Analysis of Neuron Models
827
The constant A depends on the ratio of surface area to volume for the cell. We typically use a value between 1/(100 msec) and l/sec for the constant k controlling the rate of calcium buffering. To complete the model we must specify the functions ~i([Ca])and Fi( [Ca])appearing in equation 2.5. As in our previous work (LeMasson et al. 1992) we are guided in the choice of these functions by considerations of simplicity and stability. We are primarily interested in the equilibrium behavior of the regulated model. Because of this, we can simpllfy equation 2.5 by setting all the time constants equal and making them calcium-independent, Ti( [ca]) = 7
(2.7)
where T is a constant independent of [Ca]. This simplification has no effect on the equilibrium behavior of the model. In our simulations, we have taken the time constant T to vary from 1 to 50 sec. We expect real regulatory processes to be considerably slower than this. However, the only condition on the model is that T be much longer than the time scales associated with the membrane currents so we have accelerated the regulatory process to speed up our simulations. The functions Fi determine how the asymptotic values of the maximal conductances depend on the calcium concentration. We assume that the regulation mechanism can vary the maximal conductances gi over a range 0 < gi < Gi where Gi is the largest value that gi can possibly take. In addition, a given maximal conductance can either increase or decrease as a function of the intracellular calcium concentration. These considerations lead us to consider just two possible forms (up to an overall constant) for the Fi, either a rising or a falling sigmoidal function,
where Gi, CT,and A are constants and u is the standard sigmoidal function 1 u(x) = (2.9) 1 exp(- x )
+
In equation 2.8, the parameter Gi sets the scale for the particular maximal conductance gi. CT determines the concentration at which the asymptotic value of gi is Gi/2 and A sets the slope of the sigmoid. The choice of the plus or minus in equation 2.8 determines whether gi will fall or rise as a function of [Ca]. The slow regulatory processes we are modeling must not destabilize the activity of the neuron. To assure stability of the neuron, the choice of the plus or minus sign in equation 2.8 must be made correctly. Suppose that a specific set of maximal conductances has been established producing a certain level of electrical activity. If the neuron becomes more active
828
L. F. Abbott and Gwendal LeMasson
than this level, calcium entering through voltage-activated channels will raise the intracellular calcium concentration. Under these conditions, outward currents should increase in strength and inward currents decrease so that the activity of the neuron will be reduced back to the original level. Conversely, if the activity level drops, the calcium concentration will also fall. In this case, the inward currents should increase in strength and the outward currents should decrease. In other words, the feedback from activity to maximal conductances should be negative. To assure this we use the plus sign in equation 2.8 for inward currents and the minus sign for outward currents. With this sign convention increased calcium results in an increase of the outward and a decrease of the inward currents while decreased calcium has the opposite effect. With the choices we have made, the evolution of the maximal conductances is given by (2.10)
where the variable sign is plus for inward currents and minus for outward currents. Because the intracellular calcium concentration depends on the maximal conductances, these are highly nonlinear equations. The parameter CT in equation 2.10 plays the role of a target calcium concentration. If [Ca] is well below CT,activity will increase due to the enhancement of inward and depression of outward currents. This will bring [Cal up closer to the target value CT. If [Ca] is well above CT, there will be an opposite effect on the currents and [Ca]will drop toward CT.Since the electrical activity of the neuron is highly correlated with the intracellular calcium concentration, stabilization of the intracellular calcium concentration results in a stabilization of the electrical activity of the neuron. As we will see, there is a direct connection between the target calcium concentration CTand the activity level maintained by the model neuron. Even without the dynamic regulation we have added, conductancebased neuronal models tend to be quite complex. However, the model specified above can be analyzed in considerable detail because of the large difference between the rates of the slow regulatory processes described by equations 2.10 and the faster processes of equations 2.1-2.4 and 2.6. 3 A %o-Conductance Model
The simplest model we will use to study dynamic regulation of conductances is the Morris-Lecar model (Morris and Lecar 19811, which has one inward and one outward active current. The inward current is a calcium current given (using the parameters we have chosen) by
Analysis of Neuron Models
829
and the outward current is a potassium current,
IK =g,n(V - EK)
(3.2)
with n given by
v-10
3
(3.3)
In addition, there is a passive leakage current
IL = 0.5(V + 50)
(3.4)
and we will sometimes add an external current as well. In these equations, V is measured in millivolts and time in milliseconds. Under control conditions, we take EG = 100 mV and EK = -70 mV although we will vary these parameters to simulate changes in the extracellular ion concentrations. We have added a persistent component (the 0.1 in equation 3.1) to the calcium current, which is not present in the original model (Morris and Lecar 1981). This is useful in the regulated model because calcium provides the feedback signal for the regulation process. Without a persistent component, loss of the calcium current would mean a loss of this signal. We take C = 1 pF/cm2, GQ = 3 mS/cm2, and GK = 6 mS/cm2. The behavior of this model neuron for the control values of the parameters is shown in Figure 1A. In the two-conductance model, the maximal conductances gc, and gK are regulated by equations like 2.10, specifically (3.5)
and (3.6)
We wish to analyze the dynamics of these two maximal conductances. Dividing the first equation by Gca and the second by GK we find that the quantities gca/GQ and g K / G obey ~ very similar equations. By adding the resulting two equations and using the identity u(x)+ cr(-x) = 1 we find that the quantity y defined by -
-
y = h + & Gca GK
(3.7)
obeys the trivial equation
7 dY -=l-y at
(3.8)
L. F. Abbott and Gwendal LeMasson
830
A
so
time (mSec)
B
3
I
I
I
2
-
gcn
0
Figure 1: (A) Membrane potential versus time at the quasi-steady-state point for the two-conductance model. (B)Maximal conductance "phase-plane" for the two-conductance model. Straight lines are nullclines of the slow, regulatory dynamics. Region marked Osc. is where oscillations of the regulatory system occur. The quasi-steady-state is where the two nullclines cross. Dashed paths marked 1 4 show routes to the steady-state point from four different starting conditions. For convenience (and without loss of generality) we have chosen the units of [Ca]so that the coefficient A in equation 2.6 is one. In these units CT = 20 and A = 5. In addition, we take k = 1/(100 msec). These parameters are used for Figures 2-4 as well (except that CT is vaned in Fig. 3). Likewise taking the difference of these two equations and defining -
,=Bca-& GCa
-
GK
(3.9)
we find that r -dz =tanh( dt
CT 2h - [ca]) - z
(3.10)
Analysis of Neuron Models
831
Using equations 3.8 and 3.10, we can completely analyze the behavior of the model in the "phase-plane" of maximal conductances 9c, and gK. First, there is a nullcline y = 1 or equivalently (3.11)
from equation 3.8 and this is approached exponentially with time constant 7. The behavior of the z variable is more complex. Under some conditions, z will approach a quasi-equilibrium state. An equilibrium solution of equation 3.10 would occur when z = tanh[(CT - [Ca))/2A]. However, if this value of z results in oscillatory behavior of the model neuron the calcium concentration [Ca] will oscillate as well. Thus, this value of z will not truly be fixed. We can circumvent this complication because we are assuming that the time scale T governing the motion of z is much greater than the time scale of the membrane potential oscillations. Although z will oscillate around the quasi-equilibrium value, if 7 is large these oscillations will be very small. The quasi-equilibrium value of z is just the average value of the hyperbolic tangent z = (tanh
(
cT
;PI))
(3.12)
where the brackets denote a time average over many membrane potential oscillation cycles. Equation 3.12 defines an approximate nullcline for the dynamics of the z variable for the maximal conductances. In Figure lB, the solid lines indicate the nullclines 3.11 and 3.12 for the regulatory dynamics. The diagonal line with negative slope is the y nullcline g&/G& + gK/GK = 1 while the more horizontal line is the z nullcline. In the center of the figure, where the two nullclines cross, is the quasi-steady-state point of the full system which results in the behavior seen in Figure 1A. This point is stable and its domain of attraction is the entire plane. There is a region of the plane (at the lower left of Fig. 1B) where z does not approach quasi-steady-state behavior at fixed y but instead goes into oscillations with a period of order 7. In this area there is, of course, no z nullcline. Instead, we have drawn the upper and lower bounds of the region over which the oscillations in z take place. Regions like this provide an interesting mechanism for generating rhythms with very long periods such as circadian rhythms. These slow oscillations arise from the regulatory process interacting dynamically with the more conventional mechanisms producing the much faster membrane potential oscillations. The dynamically regulated model can spontaneously construct its conductances starting from any initial values of gc, and &. The dashed curves in Figure 1B show the approach to steady-state behavior from four different sets of initial conductances. There are no obstructions to the recovery of the quasi-steady-state values from any initial position in the plane.
L. F. Abbott and Gwendal LeMasson
a32
0.07
0.06
1
'i?0.05
-
0.04
-
0.03
-
.5!
x 0
C
e
e
LL
0.02 -
10
15
20 25 Target Calcium (CJ
30
35
Figure 2: The range of steady-state oscillation frequencies that can be obtained using different values of the target calcium concentration CT.Units of calcium concentration are as in Figure 1. In the usual, unregulated, conductance-based models, the values of the maximal conductance parameters determine the behavior of the model neuron. In the regulated model, the maximal conductances are dynamic variables and, instead, the behavior of the model is governed by the parameters CT and A that control the quasi-steady-state values of the maximal conductances. Of these, CT is by far the more important parameter. By adjusting the value of this target calcium concentration, we can determine what sort of behavior the neuron will exhibit. In contrast to conventional models, once this value is chosen the desired behavior will be exhibited over a variety of external conditions. In Figure 2, we see that a wide range of oscillation frequencies can be obtained in the regulated, two-conductance model by choosing different values for the target calcium concentration CT without changing any other parameters of the model. The stabilizing effects of dynamic regulation are illustrated in Figure 3. When dynamic regulation is not included in the model, the firing frequency is extremely sensitive to the values of Eca and EK and firing only occurs over a limited range of these parameters. With dynamic regulation, stable firing at roughly the same frequency can be maintained over a wide range of Eca and EK. Since these parameters are affected by the extracellular ionic concentrations, this reflects the ability of a dy-
Analysis of Neuron Models
a33
0""'"""""""""" -100 -90 - 8 0 -70
-60
-50
-40
El;(mV)
B
Figure 3: The dependence of oscillation frequency on the equilibrium potentials for (A) potassium and (B) calcium in the regulated and unregulated two-conductance models. Dynamic regulation stabilizes the frequency against changes in EK and Eta. For the unregulated case, we fix the maximal conductances at the control values for the unregulated model.
L. F. Abbott and Gwendal LeMasson
834
-0.3 -0.35 c -0.4
-0.45 Z
-0.5 -0.55
t
-0.6 -0.65 -6
-4
-2 0 2 Current (nA)
4
6
Figure 4: The quasi-steady-state value of z as a function of the amplitude of an injected current. Both DC and pulsed injection cause shifts in the value of z that modify the balance between inward and outward currents and change the intrinsic properties of the model neuron. Pulses last for 250 msec and are repeated every 500 msec. namically regulated neuron to adjust to varying external conditions. The model maintains its firing frequency by shifting its maximal conductances in response to changes of these parameters. This is done through shifts in the value of z, which change the balance of inward and outward currents. Dynamically regulated neurons also exhibit activity-dependent shifts in their intrinsic characteristics. As we have seen, the regulatory mechanism tends to stabilize the activity of the neuron by shifting the values of the maximal conductances to maintain the level of activity that results in an average intracellular calcium concentration near the target value CT. The introduction of external or synaptic inputs will likewise cause slow shifts in the values of the maximal conductances as the regulatory mechanism tries to maintain the same level of calcium and activity that existed in the absence of inputs. As a result, prolonged inputs cause changes in the intrinsic characteristics of the neuron. This is shown in Figure 4 where we investigate the effect of external current on a regulated model neuron. The external current causes a shift in the value of z which changes the intrinsic electrical properties of the neuron by modifying the
Analysis of Neuron Models
835
balance between inward and outward currents according to equation 3.9. The quasi-steady-state value of z depends not only on the amplitude of the applied current but also on its time course. As shown in Figure 4, DC current injection has a different effect than pulses of current and we have found that the shift in z is also sensitive to the frequency and duty cycle of the pulses, in particular, the relation of the pulse frequency to the natural frequency of the model. These shifts occur over a slow time scale. Thus, the regulated model neuron will respond normally to brief pulses of current. However, prolonged current injection or synaptic input will change intrinsic properties. 4 General Analysis
The type of analysis we performed for the two-conductance model in the last section can be extended to models with arbitrarily large numbers of conductances. The key observation is that when equation 2.10 is divided by Gi, all of the ratios gi/Gi satisfy the same equation except for the plus and minus sign difference for inward and outward currents. This implies that the difference gJGi - gj/Gj between any two outward or any two inward currents will go exponentially to zero with the time constant r. Furthermore, the identity D ( X ) D ( - X ) = 1 we used before implies that the sum gi/Gi + gj/Gj, where i is an outward current and 1 is an inward current, goes exponentially to one with the same time constant. As a result, we can write an explicit solution for all of the maximal conductances satisfying equation 2.10 expressed in terms of just one dynamic variable z,
+
where the plus/minus sign is for inward/outward currents and ci are constants that determine the initial values of the maximal conductances gi(0). The remaining dynamic variable z obeys the same equation as before, r -dz= t a n h ( dt
CT 2a - [cal) - z
(4.2)
Thus we have reduced the analysis of dynamic regulation in a model with any number of currents to the study of this single equation interacting with the rest of the model through the z dependence of [Ca]. As in the two-conductance case, there are two general types of behavior. First, the system can settle down to a quasi-steady-state as far as the slow dynamics is concerned. Again, although the membrane potential and calcium concentration may fluctuate (due to action potentials for example), there are no fluctuations over the time scale associated with dynamic regulation. These faster fluctuations have little effect on the slowly
L. F. Abbott and Gwendal LeMasson
836
varying maximal conductances. Alternately, the slow system may never settle down and oscillations or even chaotic behavior characterized by the slow time scale typical of regulatory processes may appear. Again, these can provide a model of circadian or other slow rhythms. 5 A Seven-Conductance Model
We have studied dynamic regulation in a more complex and realistic model, a variant of the model of Buccholtz et al. (1992) describing the LP neuron in the stomatogastric ganglion of the crab. This model has seven active conductances corresponding to Hodgkin-Huxley sodium and potassium currents, a slow and a fast A current, a calcium-dependent potassium current, a calcium current, and a mixed-ion current ZH. In addition, there is a passive leakage current. We allow all seven maximal conductances for the active currents to be modified by the calciumdependent regulation scheme as described by equations 2.10. Depending on the value of the target calcium concentration CT,the regulated LP model can exhibit silent, tonic firing, bursting, or lockedup (permanently depolarized) behavior. Although the model has seven dynamic maximal conductance variables, we can analyze the regulatory dynamics quite simply by using the z variable defined in the last section. After the exponential terms in equation 4.1 get small, the maximal conductances will take values Gi gi = -(1 f2 )
(5.1) 2 with z determined by equation 4.2. To study the behavior of z in this model, we plot dz/dt given by the right side of equation 4.2 as a function of z in Figure 5. We also note the type of activity displayed by the model neuron for different values of z. For this figure, we have chosen the target calcium concentration CT so that the neuron exhibits bursting behavior once the z parameter has relaxed to the point where d z / d t = 0. The quasi-steady-state is given by the zero crossing in the center of the figure and it exhibits bursting behavior. In the bursting range, Figure 5 shows a double line because we have plotted both the maximum and minimum values of dzldt. At a given z value (the quasi-steady-state value for example) dz/dt will oscillate rapidly between the two lines shown due to the bursting behavior. These oscillations are not the same as those shown in Figure 1. The oscillations in Figure 1 are slow and are caused by the regulatory mechanism itself, while the oscillations here are just the result of the normal bursting activity of the neuron. In our previous work on this model (LeMasson et al. 1992) we observed an interesting phenomenon when two regulated neurons were electrically coupled. The techniques we have developed here allow us to explore this phenomenon more completely. The two-neuron circuit is shown in Figure 6. We start with two dynamically regulated model
Analysis of Neuron Models
a37
1
0.5
3
0
E_ c
.d
Ej -0.5
v
$
-1
-1.5
t -2 -0.2 -0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Figure 5: Plot of dz/dt versus z for the seven conductance model. Both minimum and maximum values of dz/dt at a given z value have been plotted. Two lines appear in the bursting region due to fluctuationsin the calcium level during bursting activity. Distinct behaviors obtained for different values of z are indicated by the inserts. Locked-up refers to a permanently depolarized state. The quasi-equilibriumvalue of z produces bursting behavior as indicated by the zero crossing of dz/dt. Model parameters used for Figures 5 and 6 are as in LeMasson et al. (1992). neurons described by identical sets of equations with the same parameter values. The identical activity of the two model neurons when they are uncoupled is shown in Figure 6A. The two neurons are then coupled through an electrical synapse (synaptic current proportional to the voltage difference between the two neurons) that is likewise completely symmetrical. Figure 6B shows the steadystate activity of the coupled network. The two neurons burst in unison. To examine the intrinsic properties of the two neurons individually, we uncouple them once again and show in Figure 6C their activity immediately after they are decoupled. Despite the fact that two model neurons are governed by identical sets of equations, the coupling between them has causes one neuron to display intrinsic bursting activity while the other fires tonically in isolation. The symmetric, two-cell network has spontaneously differentiated into a circuit involving a pacemaker and a follower neuron. If the two neurons are left uncoupled, the regulation process will eventually return them to their initial identical states as seen in Figure 6D.
L. F. Abbott and Gwendal LeMasson
838
To study this system, we monitor the maximal conductances of the two neurons and see how the coupling between them affects their behavior by performing the following numerical experiment. We hold 21, the z value for one of the two neurons, fixed but allow z2 to evolve according to equation 4.2 until it reaches its quasi-equilibrium value. This value will
E 0.24
0.23
3- 0.22 Y
F
+ 4
0.21
2 0.2 0.19
0.19
0.2
0.21
0.22
0.23
0.24
[Ca++I,(vM)
Analysis of Neuron Models
839
depend on the fixed value ZI we have chosen for the first neuron because the two neurons are coupled and this coupling effects the behavior of zz through effects like those shown in Figure 4. We then record the timeaveraged intracellular calcium concentrations of the two neurons, [Ca] and [Ca],. By repeating this process for many different holding values 21 we obtain the curves shown in Figure 6E. Actually, only one of these curves corresponds to the procedure just outlined while the other is its reflection obtained by interchanging the roles of neuron 1 and neuron 2. One curve thus shows the quasi-equilibrium calcium concentrations of neuron 2 when neuron 1 is held fixed and the other the quasi-equilibrium concentrations of neuron 1 when neuron 2 is held fixed. The values of z1 and zz determine the maximal conductances of the two neurons through the relation 5.1 and this in turn will control their intracellular calcium concentrations. Because z and [Ca] are related, we can use either the value of z or the value of the intracellular calcium concentration to characterize the balance of inward and outward maximal conductances. Up to now, we have used z because it is directly related to the maximal conductances through equation 5.1. However, to illustrate the two-neuron network we use the timeaverage of the calcium concentration in the two neurons rather than their z values because the fluctuations caused by the bursting activity of the two neurons are smaller for the timeaverage calcium concentration making the plot clearer. 0thenvise, the two approaches are completely equivalent. The quasi-steady-state configurations of the fully regulated, interacting, two-neuron circuit are given by the points where the two curves in Figure 6E cross. The interesting feature of this particular network is that the lines cross in three places. The middle of these three crossings is the symmetric equilibrium point where the calcium concentrations, the z values, and the maximal conductances of the two neurons are identical. However, as is typical in cases of spontaneous symmetry breaking, this point is unstable for this particular network. The other two crossings are stable equilibrium points and they have the novel feature that the intrinsic conductances of the two neurons are different. One neuron exhibits a higher calcium concentration than the other so, according to equation 4.2, its z value will be lower than that of the other neuron. As a Figure 6: Facing page. (A) The behavior of two identical model neurons before they are coupled. (B) Electrical coupling between the neurons results in a bursting two-cell network. (C)Decoupling the two neurons reveals their intrinsic properties and indicates that one is acting as a pacemaker and the other as a tonically firing follower. (D) Long after the two neurons are decoupled, the regulation mechanism has returned them to their original identical states. (E) A plot of the time-averaged calcium concentration of one neuron when the other neuron's regulation dynamics is held fixed. The three crossing points are equilibrium points. The central, symmetry crossing is unstable while the two outer crossing are stable quasi-steady-stateswith nonsymmetric properties.
840
L. F. Abbott and Gwendal LeMasson
result, one of the neurons will have smaller inward and larger outward conductances than the other neuron as given by equation 5.1. This is what causes the spontaneous differentiation of intrinsic properties seen in Figure 6C. The symmetry-breakingphenomenon that we have discussed requires electrical coupling between the two neurons that lies in a specific range. The coupling must be strong enough so that the two neurons have an impact on each other, but not so strong that their activity is forced to be identical. 6 Discussion
We have used a single second messenger, the intracellular calcium concentration, to act as the negative feedback element linking the maximal conductances of a model neuron to its electrical activity. If similar mechanisms exist in real neurons they may be controlled by multiple second messengers. In addition, we have taken a particularly simple form of the regulatory equations by choosing a single sigmoidal curve (and its flipped version) for all of the conductances. What is surprising about these simplifications is that they nevertheless allow the full range of behaviors of the model neuron to be explored as seen in Figure 5. The parameterization of equation 5.1 may thus be useful even in cases where dynamic regulation is not being studied. Any scheme based on a single second messenger will similarly probe a single line in the multidimensional space of maximal conductance values characterizing a particular model. The simple form of the functions Fi we used means this line is given by the simple equation 5.1, more general forms of the Fi would result in more complex curves. Nevertheless, it should be possible to find a variable like z, even with nonidentical forms for the Fi, that parameterizes path length along this general curve. As a result, we expect that the behavior of the model in the more general case will be qualitatively similar to the simple case we have analyzed. This argument also applies to models in which some of the maximal conductances are not regulated at all. We have thus far studied dynamic regulation as a global phenomenon in single compartment models. A local form of dynamic regulation could have important consequences in a multicompartment model of a neuron. In such a model, the density of channels in various parts of the neuron would be correlated with the time-average calcium concentration in that region. This provides a mechanism for controlling the distribution of conductances over the surface of a neuron (for a different approach to this problem see Bell 1991) and for correlating the local channel density with structural and geometrical characteristics affecting calcium buffering and diffusion (preliminary work done in collaboration with M.Siegel). The dynamic regulation scheme was motivated by a need to build more robust neuronal models, and Figure 3 clearly shows that this goal
Analysis of Neuron Models
841
has been achieved. The fact that the dynamically regulated model also exhibits shifts in intrinsic characteristics due to interactions with other neurons is an interesting and unavoidable consequence of this robustness. If maximal conductances depend on activity, neurons in networks will be affected by each other and will adapt accordingly. O u r two-neuron model resulted in a n oscillating circuit with a pacemaker and a follower neuron. This differentiation was caused solely by the interaction of the two neurons. Either neuron could have developed into the pacemaker with the other becoming the follower. As in this simple example, it should be possible for identical dynamically regulated model neurons to self-assemble into more complex networks in which they play welldefined but different functional roles.
Acknowledgments We wish to thank Eve Marder for her collaboration during the development of these ideas and John Rinzel for helpful comments about the mathematical reduction of slow/fast systems. Research supported by National Institute of Mental Health Grant MH-46742 and National Science Foundation Grant DMS-9208206.
References Bell, A. 1992. Self-Organization in real neurons: anti-Hebb in 'channel space'? In Neural lnfonnation Processing Systems 4, J. E. Moody and S. J. Hanson, eds., pp. 59-66. Morgan Kaufmann, San Mateo, CA. Buchholtz, F., Golowasch, J., Epstein, I., and Marder, E. 1992. Mathematical model of an identified stomatogastric neuron. I. Neurophysiol. 67,332-340. Chad, J. E., and Eckert, R. 1986. An enzymatic mechanism for calcium current inactivation in dialysed Helix neurones. 1.Physiol. (London) 378, 31-51. Franklin, J. L., Fickbohm, D. J., and Willard, A. L. 1992. Long-term regulation of neuronal calcium currents by prolonged changes of membrane potential. 1.Neurosci. 12, 1726-1735. Hemmick, L. M., Perney, T. M., Flamm, R. E., Kaczmarek, L. K., and Birnberg, N. C. 1992. Expression of the h-ras oncogene induces potassium conductance and neuron-specific potassium channel mRNAs in the AtT20 cell line. 1.Neurosci. 12, 2007-2014. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Phys. 117,500-544. Kaczmarek, L. K. 1987. The role of protein kinase C in the regulation of ion channels and neurotransmitter release. TINS 10,30-34. Kaczmarek, L. K., and Levitan, I. B., eds. 1987. Neuromodulation. The Biochemical Control of Neuronal Excitability. Oxford Univ. Press, New York, NY. Kennedy, M. B. ed. 1989. TINS 12,417-479.
842
L. F. Abbott and Gwendal LeMasson
Koch, C., and Segev, I., eds. 1989. Methods in Neuronal Modeling. MIT Press, Cambridge, MA. LeMasson, G., Marder, E., and Abbott, L. F. 1992. Activity-dependent regulation of conductances in model neurons. Science 259,1915-1917. Morgan, J. I., and Curran T. 1991. Stimulus-transcription coupling in the nervous system: Involvement of the inducible proto-oncogenes fos and jun. Annu. Rev.Neurosci. 14,421-451. Morris, C., and Lecar, H. 1981. Voltage oscillations in the barnacle giant muscle fiber. Biophys. 1. 35, 193-213. Murphy, T. H.,Worley, P. F., and Baraban, J. M. 1991. L-type voltage-sensitive calcium channels mediate synaptic activation of immediate early genes. Neuron 7,625-635. Rasmussen, H., and Barrett, P. Q. 1984. Calcium messenger system: An integrated view. Physiol. Rev. 64, 938-984. Ross, W.M. 1989. Changes in intracellular calcium duting neuron activity. Annu. Rev.Physiol. 51,491-506. Sheng, M., and Greenberg, M. E. 1990. The regulation and function of c-fos and other immediate early genes in the nervous system. Neuron 4,477485. Smeyne, R.J.,Schilling, K., Robertson, L., Luk, D., Oberdick, J., Curran, T.,and Morgan, J. 1992. Fos-lacZ transgenic mice: Mapping sites of gene induction in the central nervous system. Neuron 8, 13-23. Received 10 December 1992; accepted 26 February 1993.
This article has been cited by: 1. Niranjan Chakravarthy, Kostas Tsakalis, Shivkumar Sabesan, Leon Iasemidis. 2009. Homeostasis of Brain Dynamics in Epilepsy: A Feedback Control Systems Perspective of Seizures. Annals of Biomedical Engineering 37:3, 565-585. [CrossRef] 2. Mario F. Simoni, Stephen P. DeWeerth. 2007. Sensory Feedback in a Half-Center Oscillator Model. IEEE Transactions on Biomedical Engineering 54:2, 193-204. [CrossRef] 3. M.F. Simoni, S.P. Deweerth. 2006. Two-Dimensional Variation of Bursting Properties in a Silicon-Neuron Half-Center Oscillator. IEEE Transactions on Neural Systems and Rehabilitation Engineering 14:3, 281-289. [CrossRef] 4. T. Kohno, K. Aihara. 2005. A MOSFET-Based Model of a Class 2 Nerve Membrane. IEEE Transactions on Neural Networks 16:3, 754-773. [CrossRef] 5. Astrid A Prinz, Dirk Bucher, Eve Marder. 2004. Similar network activity from disparate circuit parameters. Nature Neuroscience 7:12, 1345-1352. [CrossRef] 6. M.F. Simoni, G.S. Cymbalyuk, M.E. Sorensen, R.L. Calabrese, S.P. DeWeerth. 2004. A Multiconductance Silicon Neuron With Biologically Matched Dynamics. IEEE Transactions on Biomedical Engineering 51:2, 342-354. [CrossRef] 7. Jonghan Shin , Christof Koch , Rodney Douglas . 1999. Adaptive Neural Coding Dependent on the Time-Varying Statistics of the Somatic Input CurrentAdaptive Neural Coding Dependent on the Time-Varying Statistics of the Somatic Input Current. Neural Computation 11:8, 1893-1913. [Abstract] [PDF] [PDF Plus] 8. Jorge Golowasch , Michael Casey , L. F. Abbott , Eve Marder . 1999. Network Stability from Activity-Dependent Regulation of Neuronal ConductancesNetwork Stability from Activity-Dependent Regulation of Neuronal Conductances. Neural Computation 11:5, 1079-1096. [Abstract] [PDF] [PDF Plus] 9. M.F. Simoni, S.P. DeWeerth. 1999. Adaptation in a VLSI model of a neuron. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing 46:7, 967-970. [CrossRef] 10. Eve Marder. 1998. FROM BIOPHYSICS TO MODELS OF NETWORK FUNCTION. Annual Review of Neuroscience 21:1, 25-45. [CrossRef] 11. Arthur Sherman. 1994. Anti-phase, asymmetric and aperiodic oscillations in excitable cells—I. Coupled bursters. Bulletin of Mathematical Biology 56:5, 811-835. [CrossRef] 12. Bard Ermentrout , Nancy Kopell . 1994. Learning of Phase Lags in Coupled Neural OscillatorsLearning of Phase Lags in Coupled Neural Oscillators. Neural Computation 6:2, 225-241. [Abstract] [PDF] [PDF Plus]
ARTICLE
Communicated by Idan Segev
Limitations of the Hodgkin-Huxley Formalism: Effects of Single Channel Kinetics on Transmembrane Voltage Dynamics Adam F. Strassberg Computation and Neural Systems Program, California Institute of Technology, Pasadena, C A 91125 U S A
Louis J. DeFelice Department of Anatomy and Cell Biology, Emory University School of Medicine, Atlanta, G A 30322 U S A
A standard membrane model, based on the continuous deterministic Hodgkin-Huxley equations, is compared to an alternative membrane model, based on discrete stochastic ion channel populations represented through Marlcov processes. Simulations explore the relationship between these two levels of description: the behavior predicted by the macroscopic membrane currents versus the behavior predicted by their microscopic ion channels. Discussion considers the extent to which these random events underlying neural signals mediate random events in neural computation.
1 Introduction Action potentials within the neuron arise from the time-variant and voltage-dependent changes in the conductance of the neural membrane to specific ions. Hodgkin and Huxley based their famous model of active membrane on an assumption that the ion permeation processes existing within the membrane can be approximated as both continuous and deterministic (Hodgkin and Huxley 1952). However, the permeation processes existing within active membrane are known to be neither continuous nor deterministic. Active membrane is studded with discrete ion channels undergoing random fluctuations between open and closed stable states (Hille 1992). There have been few studies of the relationship between these two levels of description, the discrete stochastic behavior of the microscopic ion channels versus the continuous deterministic behavior of their macroscopic membrane currents (Clay and DeFelice 1983). This paper investigates these two regimes of activity through a comparison of the standard membrane model, based on the continuous Hodgkin-Huxley equations, to an alternative membrane model, based Neural Computation 5,843-655 (1993) @ 1993 Massachusetts Institute of Technology
844
Adam F. Strassberg and Louis J. DeFelice
on discrete ion channel populations represented through Markov processes. When both models are used to simulate the active membrane of the squid Loligo giant axon, the convergence of the alternative model to the standard model can be examined. Under certain conditions, the b e havior predicted by the alternative model will diverge from the behavior predicted by the standard model. Under these conditions, simulations suggest that random microscopic behavior, such as single channel fluctuations, becomes capable of generating random macroscopic behavior, such as entire action potentials. 2 Methods
The neural membrane of a space-clamped squid giant axon is modeled with an equivalent electric circuit. The space-clamp technique removes the spatial dependence of the membrane voltage and so the axon becomes effectively equivalent to an isopotential patch of membrane. A simple lumped circuit model thus can interpret the electrical characteristicsof the membrane. The macroscopic membrane conductances are represented by the conductive elements gNa, gK, and gL and the transmembrane voltage V, behaves according to the equation:
Behavior predicted by the standard Hodgkin-Huxley equations for the time-variant and voltage-dependent membrane conductances gNa and gK is compared to behavior predicted by alternative descriptions for these conductances based on their underlying ion channel populations. Ion channel activity is modeled well by Markov processes. Each channel is assumed to randomly fluctuate between only a finite number of discrete stable states. Transition probabilities between these stable states are assumed to depend on the present stable state and the present membrane voltage and to be independent of the duration for which this present stable state has been occupied. Such Markov assumptions are used to interpret the data from patch-clamp experiments on single ion channels. These data often are too limited to allow for the isolation of a single Markov kinetic scheme from the several alternative schemes (Strassberg and DeFelice 1992; Kienker 1989; Clay and DeFelice 1983; Conti et al. 1975; Hille 1992). For this simulation of the ion channel populations underlying the membrane conductances gNa and gK, the simplest noncooperative and serial schemes have been chosen from the set of schemes capable of generating the desired macroscopic behavior. Llano et al. (1988) have patch-clamped voltage-gated potassium channels in the active membrane of the squid giant axon. These channels show a single open state with a potassium ion conductance of 20 pS
-
Limitations of the Hodgkin-Huxley Formalism
845
(Llano et al. 1988). The following Markov kinetic scheme will reproduce the observed microscopic potassium channel behavior:
[no] '2' [nl] '2 4n
2411
[n2]
'2 34"
[n3]
2 44n
[n4]
(2.2)
where [nil refers to the number of channels within the population currently in stable state ni, n4 labels the single open state, and a, and /?n are the voltage-dependent rate constants from the Hodgkin-Huxley formalism (Armstrong 1969; Llano et al. 1988; Fitzhugh 1965; Hille 1992). Vandenberg and Bezanilla (1988) have patch-clamped voltage-gated sodium channels in the active membrane of the squid giant axon. These channels show a single open state with a sodium ion conductance of 20 pS (Bezanilla 1987; Vandenberg and Bezanilla 1988). The following Markov kinetic scheme will reproduce the observed microscopic sodium channel behavior: N
(2.3)
where [rnih,]refers to the nurr.,er of channels with..\ the population currently in stable state rnihj, m3hl labels the single open state, and a,, P,, ahr and /?h are the voltage-dependent rate constants from the HodgkinHuxley formalism (Bezanilla 1987; Vandenberg and Bezanilla 1988; Hille 1992; Fitzhugh 1965). Simulation parameters are chosen to be identical to those values for squid axonal membrane used by Hodgkin and Huxley in their seminal paper (Hodgkin and Hwley 1952):
C,
T EL gL EK
gK DK 7~
EN^ gNa
DN
ma
1 pF/cm2 Membrane capacitance 6.3 "C Temperature 10.613 mV Leakage Nernst potential 0.3 mS/cm2 Leakage conductance -12.0 mV Potassium Nernst potential 36 mS/cm2 Maximal potassium conductance Potassium ion channel density 18 channels/pm2 Potassium channel open state conductance 20 ps 115.0 mV Sodium Nernst potential 120 mS/cm2 Maximal sodium conductance 60 channels/pm2 ~Sodium ion channel density Sodium channel open state conductance 20 p s
For a membrane model using discrete stochastic channel populations
Adam F. Strassbergand Louis J. DeFelice
846
with the given Markov kinetics, 2.2 and 2.3, the potassium and sodium membrane conductances will satisfy
gK(v,t ) = yK[n4]
g N a ( v , t ) = ?Na[m3h]
For a membrane model using continuous deterministic Hodgkin-Huxley equations, the potassium and sodium membrane conductances will satisfy
gK(v,t ) = g K n 4
gNa(v,
t ) =gNam3h
All simulations of membrane behavior are performed using GENESIS,' an object-oriented general purpose neural simulator for the UNIX/Xwindows environment. Two new GENESIS objects are designed to model squid potassium and sodium ion channel populations undergoing Markov kinetics as given by 2.2 and 2.3 respectively2 GENESIS scripts are produced for an isopotential patch of squid giant axon membrane under both the voltage-clamp and free-running experimental paradigm with membrane conductances represented through either Hodgkin-Huxley equations or channel populations with Markov kinetics. 3 Results
Figure 1 shows the voltage-clamp step response of the membrane conductances 8K and &. Both the continuous Hodgkin-Huxley equations and the discrete channel population Markov models are used alternatively to represent the membrane conductances. Note that as the size of each channel population is increased, the response from the discrete channel model converges to the behavior predicted by the continuous Hodgkin-Huxley currents. Figure 2 shows the response of a free-running membrane patch to a constant current injection and Figure 3 shows the resting response with no current injection. Figure 4 compares the mean firing frequencies of these responses to the membrane surface area of their underlying patches. For fixed channel densities, as the membrane surface area is increased, the response from the simulation of a constant density of ion channels converges to the response from the standard 'GENESIS 01989 designed by Matt Wilson and Jim Bower at California Institute of Technology. Inquiries may be directed to [email protected] [email protected]. %Vera1 new objects and scripts have been incorporated into the source code of GENESIS v1.4 as of July,1992. Inquiries may be directed to [email protected].
Limitations of the Hodgkin-Huxley Formalism
0
5
10
1
Tme(rnsec)
step
847
15
20
Figure 1: Voltage-clamp step response of membrane conductance. Membrane voltage is clamped to V,t and stepped to V,t 50.0 mV at t = 5.0 msec. The response of each active membrane conductance is simulated with varying populations of discrete channels undergoing appropriate voltage-dependent Markov kinetics. These responses are compared to the behaviors predicted by the continuous Hodgkin-Huxley equations. Note that all outputs are normalized and displaced appropriately for display. As the size of each channel population is increased, the response from the discrete channel model converges to the behavior predicted by the continuous Hodgkin-Huxley equations, for both (top) the potassium conductance & ( t ) and (bottom) the sodium conductance gNa(t).
+
Adam F. Strassberg and Louis J. DeFelice
848
1
0
1
1
1
1
1
1
1
1
1
1
40
1
60
1
1
1
I
80
I
1
1
100
TinPlnsec)
Figure 2 Membrane response with injection current. The membrane model is simulated with standard biophysical parameters for squid axonal membrane (Cm, ENa, EK,EL,gL) and with constant current injection (1inject= 100 pA/pm2). The continuous Hodgkin-Huxley equations and the discrete channel populations are used alternatively to represent the membrane conductances gNa and gK. As the membrane surface area is increased, the response from the channel model converges to the response from the standard Hodgkin-Huxley model. Both models predict that a regular train of action potentials will occur when this constant current is injected. Note that, as the membrane surface area is decreased, the regularity of the spike train generated by the channel model diverges from the behavior predicted by the Hodgkin-Huxley model.
Hodgkin-Huxley model. Both models predict that, for large membrane surface areas, a train of action potentials will occur when constant current is injected and that no activity will occur when no current is injected. However, note that, as the membrane suface area is decreased, the be-
Limitations of the Hodgkin-Huxley Formalism
l
I
I
I
I
l
I
I
I
l
l
I
849
I
l
I
l
I
I
J
Continuous
0
20
60
40
80
100
Timelmsecl
Figure 3 Membrane response without injection current. The membrane model is simulated with standard biophysical parameters for squid axonal membrane (CmrENalEK,EL,gL) and with no current injection (Iinp. = 0 pA/pm2). The continuous Hodgkin-Huxley equations and the discrete channel populations are used alternatively to represent the membrane conductancesgNa and gK. As the membrane surface area is increased, the response from the channel model converges to the response from the standard Hodgkin-Huxley model. Both models predict that no activity occurs when no current is injected. However, as the membrane surface area is decreased, the active behavior predicted by the channel model diverges dramatically from the lack of activity predicted by the Hodgkin-Huxley model.
havior predicted by the channel model diverges dramatically from the behavior predicted by the Hodgkin-Huxley model. These simulations suggest that, for an isopotential membrane patch with constant densities
Adam F. Strassberg and Louis J. DeFelice
850
t
Channel model response t o :100 PA/&
Hcdkqin-Huxley lode1 response
to
:
0 PA/&
Figure 4 Mean firing frequency versus membrane area. For a given membrane area and a given constant current injection, the number of attendant action potentials is averaged over a l sec duration to derive a mean firing frequency. As membrane area increases, the firing frequencies from the channel model converge to the firing frequencies from the Hodgkin-Huxley model. However, as membrane area decreases, these responses diverge dramatically. These simulations suggest that, as the area of an isopotential membrane patch is decreased, the voltage noise from single channel fluctuations will become capable of eliciting entire action potentials. (Over the smaller membrane surface areas, the graph shows the mean firing rates to first increase and then decrease. For such small regions, the opening of a single channel will depolarize the membrane to EN^ and so the definition of "action potential" becomes somewhat obfuscated.)
of sodium and potassium channels, as the membrane area is decreased, the fluctuations of single channels will become capable of eliciting entire action potentials.
Limitations of the Hodgkin-Huxley Formalism
851
4 Discussion
The standard membrane model, based on the Hodgkin-Huxley equations, has been compared to an alternative membrane model, based on ion channel populations represented through Markov processes. When both models are used to simulate the active membrane of the squid Loligo giant axon, the explicit convergence of the alternative model to the standard model can be observed. However, under certain conditions, the behavior predicted by the alternative model diverges dramatically from the behavior predicted by the standard model. 4.1 Membrane Voltage Perturbations Due to Single Ion Channel Fluctuations. The divergent behavior can be explained through an analysis of the voltage perturbations across the membrane due to single ion channel fluctuations. Whenever a single ion channel moves from a closed state into an open state, the transmembrane voltage V , behaves according to the first-order transient:
Vm(t)= AV,(l
-e-;)
+ vest
The magnitude AV,, of the resultant voltage perturbation is mediated by a voltage divider between the conductance of the opened ion channel and the conductance of the surrounding membrane, which includes both the leakage conductance and the summed conductances of all other currently opened ion channels. The rise-time r of this resultant voltage perturbation is equal to the membrane capacitance divided by the total membrane conductance. Note that there will be a correction term to the usual area-independent T because the total membrane conductance is now the sum of both the conductance of the membrane surrounding the opened channel, which does scale with area, and the conductance of the individual opened channel, which does not scale with area. For a given ion channel, the magnitude of the open state conductance and the voltage dependence of the stable state kinetics are independent of the surface area of the surrounding membrane. However, when this ion channel enters the open state, both the magnitude AV,,, and the risetime 7 of the resultant voltage perturbation across the membrane are dependent on the surface area .of the surrounding membrane. For the specific biophysical parameters of squid axonal membrane, the voltage perturbation due to the random opening of a single sodium channel simplifies to
where A is the total membrane surface area (Strassberg and DeFelice 1992). As this surface area A is decreased, the magnitude AV,, increases
854
Adam F. Strassberg and Louis J. DeFelice
of averaged data trials, which show much less variability. However, the fluctuations of single channels are probabilistic events with a high degree of variability from one trial to the next. With the current absence of any strong consensus on how the nervous system encodes information (beyond the level of sensory transduction), one is unable to distinguish strongly the “signal” from the “noise.” The filtering and averaging of the data to remove the “noise” thus may be removing important components of the data. (Bower and Koch 1992). Although the full prevalence and abundance of spontaneous action potentials are presently unknown, many potential roles for such spontaneous activations do exist in neural computation. While the effect of noise in a sensory system may be generally detrimental, the effect of noise in a planning, coordination, or motor system would not necessarily be as severe. For example, spontaneous action potentials could stop the repetition of unrewarded preprogrammed behaviors or perhaps even allow for the generation of entirely new responses to novel stimuli. During neurodevelopment, random activity could play a role in the coordination, correlation, and robust tuning of receptive field structures. From neuroethology, we know that organisms generate a host of spontaneous behavior patterns on the macroscopic level, thus it is reasonable to hypothesize that such spontaneous macroscopic behaviors arise from spontaneous microscopic behaviors. This paper has used simulation and analysis to show that theoretical mechanisms exist for both the attenuation and the amplification of single channel noise. Experimental convention typically has ignored the underlying stochastic nature of the neuron in favor of the averaged neural response properties. However, as more physiological data on spontaneous activations do become available, the degree to which the random microscopic events underlying neural signals mediate random macroscopic events in neural computation will become more apparent. Acknowledgments This material is based on work supported under a National Science Foundation Graduate Fellowship and an NIH HL-27385. We would like to express our deep appreciation to Dr. Christof Koch for his comments and suggestions throughout the preparation of this manuscript. We also would like to thank Hsiaolan Hsu and Dr. Henry Lester for helpful insights. References Armstrong, C. M. 1969. Inactivation of the potassium conductance and related phenomena caused by quaternary ammonium ion injected in squid axons. 1.Gen. Physiol. 54, 553-575.
Limitations of the Hodgkin-Hwdey Formalism
855
Bezanilla, F. 1987. Single sodium channels from the squid giant axon. Biophys. J. 52, 1087-1090. Bower, J., and Koch, C. 1992. Experimentalists and modelers: can we all just get along? Tr. Neurosci. 15,458461. Clay, J., and DeFelice, L. 1983. Relationship between membrane excitability and single channel open-close kinetics. Biophys. J. 42, 151-157. Conti, F., DeFelice, L. J., and Wanke, E. 1975. Potassium and sodium ion current noise in the membrane of the squid giant axon. J. Physiol. (London) 248,4582. Fitzhugh, R. 1965. A kinetic model of the conductance changes in nerve membrane. J. Cell. Comp. Physiol. 66,Suppl., 111-117. Franciolini, F. 1987. Spontaneous firing and myelination of very small axons. J. Theor. Bid. 128, 127-134. Hille, B. 1992. Ionic Channels of Excitable Membrane, 2nd ed. Sinauer Associates, Sunderland, MA. Hodgkin, A. L., and Huxley, A. F. 1952. A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London) 117,500-544. Kienker, P. 1989. Equivalence of aggregated Markov models of ion-channel gating. Proc. R. SOC.London B 236,269-309. Koch, C., Zador, A., and Brown, T. H. 1992. Dendritic spines: Convergence of theory and experiment. Science 256,973-974. Llano, I., Webb, C. K., and Bezanilla, F. 1988. Potassium conductance of the squid giant axon. J.Gen. Physiol. 92, 179-196. Segev, I., and Rall, W. 1988. Computational study of an excitable dendritic spine. J. Neurophys. 60,499-523. Strassberg, A. F., and DeFelice, L. J. 1992. Limitations of the Hodgkin-Huxley formalism. Computation and Neural Systems Program-Memo 24, California Institute of Technology. Vandenberg, C. A., and Bezanilla, F. 1988. Single-channel, macroscopic and gating currents from Na channels in squid giant axon. Biophys. J. 53,226a.
Received 16 March 1992; accepted 26 February 1993.
This article has been cited by: 2. B. Sengupta, S. B. Laughlin, J. E. Niven. 2010. Comparison of Langevin and Markov channel noise models for neuronal signal generation. Physical Review E 81:1. . [CrossRef] 3. Shawn R Lockery, Miriam B Goodman. 2009. The quest for action potentials in C. elegans neurons hits a plateau. Nature Neuroscience 12:4, 377-378. [CrossRef] 4. Ian C. Bruce. 2009. Evaluation of Stochastic Differential Equation Approximation of Ion Channel Gating Models. Annals of Biomedical Engineering 37:4, 824-838. [CrossRef] 5. YuBing Gong, YanHang Xie, Bo Xu, XiaoGuang Ma. 2009. Effect of gating currents of ion channels on the collective spiking activity of coupled Hodgkin-Huxley neurons. Science in China Series B: Chemistry 52:1, 20-25. [CrossRef] 6. Yubing Gong, Yanhang Xie, Yinghang Hao. 2009. Coherence resonance induced by the deviation of non-Gaussian noise in coupled Hodgkin–Huxley neurons. The Journal of Chemical Physics 130:16, 165106. [CrossRef] 7. C. M. Gómez. 2008. Numerical exploration of the influence of neural noise on the psychometric function at low stimulation intensity levels. Journal of Biosciences 33:5, 743-753. [CrossRef] 8. Marifi Güler. 2008. Detailed numerical investigation of the dissipative stochastic mechanics based neuron model. Journal of Computational Neuroscience 25:2, 211-227. [CrossRef] 9. A. Aldo Faisal, Luc P. J. Selen, Daniel M. Wolpert. 2008. Noise in the nervous system. Nature Reviews Neuroscience 9:4, 292-303. [CrossRef] 10. YuBing Gong, Bo Xu, XiaoGuang Ma, JiQu Han. 2008. Effect of channel block on the collective spiking activity of coupled stochastic Hodgkin-Huxley neurons. Science in China Series B: Chemistry 51:4, 341-346. [CrossRef] 11. Marifi Güler. 2007. Dissipative stochastic mechanics for capturing neuronal dynamics under the influence of ion channel noise: Formalism using a special membrane. Physical Review E 76:4. . [CrossRef] 12. Ian C. Bruce. 2007. Implementation Issues in Approximate Methods for Stochastic Hodgkin–Huxley Models. Annals of Biomedical Engineering 35:2, 315-318. [CrossRef] 13. G Schmid, I Goychuk, P Hänggi. 2006. Capacitance fluctuations causing channel noise reduction in stochastic Hodgkin–Huxley systems. Physical Biology 3:4, 248-254. [CrossRef] 14. Yubing Gong, Maosheng Wang, Zhonghuai Hou, Houwen Xin. 2005. Optimal Spike Coherence and Synchronization on Complex Hodgkin-Huxley Neuron Networks. ChemPhysChem 6:6, 1042-1047. [CrossRef] 15. Loredana Mereuta, T. Luchian. 2005. How could a chirp be more effective than a louder clock-resonant transfer of energy between subthreshold excitation pulses
and excitable tissues. Journal of Cellular and Molecular Medicine 9:2, 446-456. [CrossRef] 16. G Schmid, I Goychuk, P Hänggi. 2004. Effect of channel block on the spiking activity of excitable membranes in a stochastic Hodgkin–Huxley model. Physical Biology 1:2, 61-66. [CrossRef] 17. J. Casado, J. Baltanás. 2003. Phase switching in a system of two noisy Hodgkin-Huxley neurons coupled by a diffusive interaction. Physical Review E 68:6. . [CrossRef] 18. Amit Manwani , Peter N. Steinmetz , Christof Koch . 2002. The Impact of Spike Timing Variability on the Signal-Encoding Performance of Neural Spiking ModelsThe Impact of Spike Timing Variability on the Signal-Encoding Performance of Neural Spiking Models. Neural Computation 14:2, 347-367. [Abstract] [PDF] [PDF Plus] 19. J. Shuai, P. Jung. 2002. Optimal Intracellular Calcium Signaling. Physical Review Letters 88:6. . [CrossRef] 20. G Schmid, I Goychuk, P Hänggi. 2001. Stochastic resonance as a collective property of ion channel assemblies. Europhysics Letters (EPL) 56:1, 22-28. [CrossRef] 21. P Jung, J. W Shuai. 2001. Optimal sizes of ion channel clusters. Europhysics Letters (EPL) 56:1, 29-35. [CrossRef] 22. Amit Manwani , Christof Koch . 1999. Detecting and Estimating Signals in Noisy Cable Structures, I: Neuronal Noise SourcesDetecting and Estimating Signals in Noisy Cable Structures, I: Neuronal Noise Sources. Neural Computation 11:8, 1797-1829. [Abstract] [PDF] [PDF Plus] 23. Elad Schneidman , Barry Freedman , Idan Segev . 1998. Ion Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike TimingIon Channel Stochasticity May Be Critical in Determining the Reliability and Precision of Spike Timing. Neural Computation 10:7, 1679-1703. [Abstract] [PDF] [PDF Plus] 24. G Renversez, O Parodi. 1996. Potential distribution on a neuronal somatic membrane during an action potential. Europhysics Letters (EPL) 36:4, 313-318. [CrossRef] 25. Alain Destexhe, Zachary F. Mainen, Terrence J. Sejnowski. 1994. Synthesis of models for excitable membranes, synaptic transmission and neuromodulation using a common kinetic formalism. Journal of Computational Neuroscience 1:3, 195-230. [CrossRef] 26. M. Giugliano, M. ArsieroBiological Neuronal Networks, Modeling of . [CrossRef]
Communicated by Christof Koch
Two-Dimensional Motion Perception in Flies A. Borst
M. Egelhaaf Max-Planck-lnstitut fiir biologische Kybernetik, Spemannstrasse 38,7400 Tiibingen, Germany
H. S . Seung* Racah lnstitute of Physics and Center for Neural Computation, Hebrew University, Jerusalem 91904, Israel
We study two-dimensional motion perception in flies using a semicircular visual stimulus. Measurements of both the H1-neuron and the optomotor response are consistent with a simple model supposing spatial integration of the outputs of correlation-type motion detectors. In both experiment and model, there is substantial H1 and horizontal (yaw) optomotor response to purely vertical motion of the stimulus. We conclude that the fly's optomotor response to a two-dimensional pattern, depending on its structure, may deviate considerably from the direction of pattern motion. 1 Introduction
The projection of the velocity vectors of objects moving in three-dimensional space on the image plane of an eye or a camera can be described as a vector field. This two-dimensional velocity field is time-dependent and assigns the direction and magnitude of a velocity vector to each point in the image plane. However, the velocity field is a purely geometric concept and does not directly represent the input of a visual system. Instead, the only information available to a visual system is given by the time-dependent brightness values as sensed by photoreceptors in the image plane. The problem of motion perception has often been posed as that of recovering the velocity field from these brightness values. For the case of simple translation of a Lambertian surface under uniform illumination, this computation can be done (Verri and Poggio 1989). Such a physical motion leads to the translation of a brightness pattern across the image plane. Several known local algorithms (Reichardt et al. 1988; Srinivasan 1990; Uras et al. 1988) recover the correct velocity field, which is constant in space and time. Algorithms utilizing a smoothness ~~~
'Present address: AT&T Bell Laboratories, Murray Hill, NJ 07974 USA.
Neural Computation 5, 8!%-868 (1993) @ 1993 Massachusetts Institute of Technology
Motion Perception in Flies
857
regularizer (Horn and Schunck 1981; Hildreth and Koch 1987) also perform well in extracting the true velocity. All these algorithms fail to yield a unique answer only for the special case of one-dimensional patterns. This is because a moving one-dimensional pattern is consistent with an infinite number of velocity fields (Ullman 1983; Horn and Schunk 1981; Hildreth and Koch 1987; Reichardt et al. 1988). In contrast, the superposition of two differently oriented one-dimensional sine gratings, a plaid pattern, has a uniquely determined velocity vector. The direction of motion of such a two-dimensional (2d) pattern, which is different from the orientations of its one-dimensional (Id) component gratings, is perceived by human observers under certain conditions (Adelsen and Movshon 1982; Ferrera and Wilson 1990; Stone et al. 1990; Stoner et al. 1990). On the basis of physiological experiments with plaid stimuli, Movshon and co-workers (1986) have argued that motion processing in the primate visual system takes place in two stages. The first stage is composed of local movement detectors in areas V1 and MT sensitive to the orientation of the components. The second stage of processing is composed of neurons in MT that respond to the direction of pattern motion, presumably computing it from the output of the first stage. In this work, we examine whether the fly visual system computes the direction of motion of a 2d pattern. In the past, the fly has proven to be an excellent model system for analyzing motion detection. Most notably, much is known about the structure and physiology of its visual ganglia and the motion-dependent behaviors controlled by them (fw review, see Egelhaaf et al. 1988; Hansen and Egelhaaf 1989; Egelhaaf and Borst 1993). One such motion-dependent behavior is the optomotor response, in which the fly tends to move so as to stabilize a moving visual surround (Fermi and Reichardt 1963; Gotz 1972). A simple model of the fly’s optomotor pathway has been quite successful in accounting for both neurophysiological and behavioral data. According to this integrated correlation model, there are local movement detectors of the correlation type (Fig. 1A) (Hassenstein and Reichardt 1956; Reichardt 1961; Borst and Egelhaaf 1989) that are organized in two-dimensional retinotopic arrays covering the entire visual field. A set of identified, directionally selective, large-field interneurons in the third visual ganglion spatially integrates over the responses of horizontally oriented detectors in this array (Hansen and Egelhaaf 1989). The yaw optomotor response is a low-pass filtered version of the output of this horizontal system (Egelhaaf 1987). There is also a vertical system in the third visual ganglion, believed to mediate the pitch optomotor response according to an analogous model (Hengstenberg 1982). The visual pattern used in our experiments was a dark circular disk moving on a bright background. The predictions of the integrated correlation model were compared with the responses of the H1-neuron (a cell
A. Borst, M. Egelhaaf, and H. S. S u n g
858
A
8
- input lines - temporal filters
- nonlinear interactions
4-
-
subtraction
Figure 1: Outline of the motion detector model used to derive the predictions shown in Figures 2 and 3. (A) Single correlation-typemotion detector consisting of two mirror-symmetrical subunits with opposite preferred directions. Each subunit has two input lines. The signal in one line is temporally filtered and then multiplied with the unfiltered signal of the other line. The outputs of the subunits are subtracted from each other. (B)Responses of a two-dimensional array of orthogonally oriented pairs of motion detectors to a disk moving in the y direction. Shown is the vector field of equation 2.3, which was calculated using the continuum approximation of Reichardt (1987). The components of each vector are the responses of the x- and y-detectors at that point. The response is only nonzero on the boundary of the disk.
Motion Perception in Flies
859
integrating horizontally oriented motion detectors) and the optomotor response about the vertical axis (yaw torque). 2 Responses of Correlation-'Ilpe Motion Detectors
Our model consists of a twodimensional square lattice of correlation-
type motion detectors.' At each point of the lattice is a pair of detectors that is oriented along the x and y axes. The luminance of the stimulus is denoted by I(r,t), where r is a vector in the xy plane and t is time. We treat the responses of the detector pair at r as the components of a vector,
+
Z(r, t)Z(r ex,t ') = I(r, t)Z(r ey,t
[
+
+ At) - Z(r + ex,t)Z(rlt + At) + At) - I(r + ey,t)Z(r,t + At)
1
(2.1)
Here ex and ey are vectors in the x and y directions, corresponding to the spacing between adjacent lattice points. The two terms of each component of d correspond to the opponent subunits shown in Figure lA, each a spatiotemporal cross-correlation of luminances. In equation 2.1 the temporal filtering is written as a simple time delay At, but in our computer simulations was more realistically modeled as a low-pass filter. The response of the H1-neuron is modeled as the x component of the integrated response vector, D(t) = /drd(r, t)
(2.2)
The yaw torque of the fly is modeled as a low-pass filtered version of the neural response. We chose a time constant of 3 sec, which is consistent with Egelhaaf (1987) and yields a good fit to the experimental data. Figure 1B shows that the local response d(r, t) to a upwardly moving circular disk is very different from its velocity field. The velocity field (not shown) is everywhere constant and points in the y direction. The local response, on the other hand, varies greatly: it is zero inside and outside the disk, and takes the form
d(6) oc isin6
(2.3)
on the boundary. Here 6 denotes the angle from the x axis, and i is the unit vector in the radial direction. This formula follows from equation 4.4 of the appendix, where the local response is calculated analytically using a continuum approximation (Reichardtand Guo 1986; Reichardt 1987; Reichardt and Schlogl1988). It is evident from Figure 1B that the x-detectors bias the direction away from the true velocity. At the upper right and lower left edges they signal positive (rightward) motion whereas at the opposite sides they signal negative (leftward) motion. Thus the circle in vertical motion mimics horizontal motion at its obliquely oriented edges. 'A more realistic triangular lattice model of the fly photoreceptor array (Buchner 1976; Buchner ef al. 1978) yields similar predictions.
860
A. Borst, M. Egelhaaf, and H. S.Seung
This is not surprising, since for this stimulus the brightness in both input channels of a horizontal motion detector is either increased or decreased during vertical motion in the same temporal sequence as during horizontal motion. The pattern dependence of the Zocal response is manifest, and has been studied previously (Reichardt 1987; Borst and Egelhaaf 1989). However, the flight control system of the fly is thought to depend on the integrated response of such an array of motion detectors (Egelhaaf et al. 1989). Although there is significant local x-response to vertical motion of a circle (Fig. lB), the integrated x-response is exactly zero because the contributions from the left and right halves of the circle cancel each other out. Hence, for a full circular stimulus the direction of the integrated response of an array of correlation-type motion detectors is the same as that of a true velocity sensor. To create a stimulus without such cancellation effects, the circIe was moved behind a square aperture in such a way that maximally only a semicircle was visible. Figures 2 and 3A exhibit the analytic results and numerical simulations for this semicircular stimulus. Tho features of the model predictions are noteworthy: (1) In contrast to a true velocity sensor, the integrated x-output responds not only to horizontal motion but also to vertical motion, with a time course that depends on the stimulus pattern. (2) The response to horizontal motion remains the same sign throughout the duration of the stimulus, the sign depending on the direction of motion. In contrast, the response to vertical motion changes sign when half of the semicircle is visible in the aperture, thereby erroneously mimicking an inversion of the direction of motion. 3 Responses of the Fly
These predictions were first tested by recording the spike activity of the H1-neuron in female blowflies (Calliphora erythrocephla) following standard procedures (Hausen 1982; Borst and Egelhaaf 1990). The resulting spike-frequency histograms of response to a moving semicircular stimulus are shown in Figure 3B. The preferred direction of the H1-neuron, when tested with Id grating patterns, is horizontal motion from back to front in the entire visual field of one eye (McCann and Dill 1969; Hansen 1976). Similarly, we found that the neuron was excited by horizontal motion of the semicircular stimulus in the preferred direction, and slightly inhibited by motion in the null direction. The magnitude of the null response is smaller than that of the preferred response, probably due to the low spontaneous activity of the cell and the resulting rectification nonlinearity. The response of the H1-neuron to vertical motion of a Id grating pattern is negligible (Hansen 1976). However, during vertical motion of the semicircular stimulus, the neuron’s response shows pronounced modulations and even a sign reversal relative to the resting level (hori-
Motion Perception in Flies
861
3 2.5 2 1.5 1
Wma
0.5 0
-0.5 -1
b
Y
1
t
Figure 2 Spatially integrated response of correlation-type motion detectors to a moving circle seen through an aperture, calculated in equations A.8 and A.9 using the continuum approximation of Reichardt (1987). The x-response to motion in the y-direction (transverse response), takes the form Mxy(7)= cr [l - (1- 2 ~ ) ~and ] , the x-response to motion in the x-direction (longitudinal response) takes the form M,,(y) = crcos-'(l - 27). The prefactor cr has a quadratic dependence on the contrast of the stimulus, and 7 is the fraction of the semicircle that is visible in the aperture. These formulas are valid for the first half of the stimulus period, when 7 is increasing from 0 to 1. The formulas for the second half of the period are similar. The stimulus trace indicates the visible part of the circle at various instances of time. zontal line in Figure 3B). The responses to upward and downward motion are modulated in the same way but have opposite signs. Except for the fact that the neuron's responses to vertical motion are almost as strong as its responses to horizontal motion, all response features are in good agreement with the predictions of the correlation-detector model (compare Fig. 3B with Figs. 3A and 2). Hence, the response of the H1-neuron
A. Borst, M. Egelhaaf, and H. S. Seung
862
100
i
m
521
ID
P 0
2
oucluoucluo
Figure 3: (A) Spatially integrated responses of a square lattice of correlationtype motion detectors to a black circle moving in various directions on a bright background behind a square aperture. A 20 x 20 array of motion detectors of the correlation type (Fig. 1) was simulated on an IBM PS/2 using the MYST language (Keithley Instruments). The motion detectors had a sampling base of one lattice constant, a first-order low-pass filter as delay line, and were horizontally oriented with preferred direction from left to right. Note that the responses to horizontal and vertical motion are scaled differently. (B) Responses of an identified, directionally selective, motion-sensitive,large-field neuron (H1-cell) of the blowfly Culliphoru eythrocephlu to the same stimulus. Stimuli were generated on a CRT (Tektronix 608) controlled by an IBM AT through an image synthesizer (Picasso, Innisfree Inc.) with 200 Hz frame rate. The luminance of the circle was 4 cd/m2, and that of the background was 24 cd/m2. The contrast (I,,, - I-)/(lmx I m h ) amounted to 71%. The square aperture had the same extent as the diameter of the circle. The stimulus was presented to only the left eye of the fly at a distance of 7 cm. The circle had a diameter of 70" as seen by the fly. The center of the aperture was at 35" horizontal position and 0" vertical position with respect to the fly. Shown are the mean spike frequency histograms (40ms binwidth) f the SEM of the recordings of the H1 responses of 10 flies. Each fly was tested between 50 and 100 times (71 times on average) for each stimulus condition. The cell had rightward motion as its preferred direction. The horizontal line marks the resting firing level. The stimulus trace indicates the visible part of the circle at various instants in time.
+
is not simply the horizontal component of pattern motion. Measuring from neurons in the vertical system would presumably produce analogous results. We can conclude that the large-field cells in the third visual
Motion Perception in Flies
863
ganglion of the fly do not represent the x and y coordinates of the pattern motion vector.2 This finding, however, does not rule out the possibility that the x and y components of pattern motion are computed at some later processing stage in the motion pathway of the fly. Therefore, we recorded the fly's behavioral turning responses about its vertical axis.3 These measurements were done on female flies of the species Muscu domesticu suspended from a torque-meter (Gotz 1964)following standard procedures (Egelhaaf 1987).The signals of the computer simulations shown in Figure 3A were passed through a first-order low-pass filter with a time-constant of 3 sec to account for the experimentally established low-pass filter between the thud visual ganglion and the final motor output (Egelhaaf 1987). This leads to smoothing and phase shifting of the original signal (compare Fig. 4A with 3A). As was found for the H1-neuron, the behavioral responses induced by the semicircle moving either horizontally or vertically are almost perfectly mimicked by the computer simulations (Fig. 4). Again, pronounced responses are induced not only during horizontal motion but also during vertical motion. The latter responses show a quasisinusoidal modulation and, hence, the same sign reversal observed in the H1-response and in the simulations. 4 Conclusions
In principle it is possible to compute pattern velocity from the output of an array of correlation-type motion detectors (Reichardt et al. 1988; Reichardt and Schlogl1988), provided that the second spatial derivatives of the pattern are also available and nonzero. Nevertheless, we find no evidence of such a computation in the fly; the output of its local motion detectors appears to undergo no more than a simple spatial integration and temporal filtering. Consequently, depending on the structure of the stimulus pattern, the direction of the optomotor response is not generally the same as the direction of pattern velocity. Since the function of the optomotor response is believed to be course stabilization, it might seem a deficiency for the response to be in the "wrong" direction. How can an organism such as the fly that is able to perform fast and virtuosic visually guided flight maneuvers afford to confound different directions in such a dramatic way? *The response of the H1-neuron to the vertical motion of a full circle was also measured. Contrary to the predictions of our simple model, there was some small nonzero response. Refinements of the model can be introduced to account for such incomplete cancellation of response, such as unbalanced subtraction of the two subunits (Egelhaaf et al. 1989) and/or spatially inhomogeneous sensitivity (Hansen 1982). 3Unlike the H1 experiments, two copies of the stimulus were presented simultaneously, one to each eye of the fly. Since the optomotor response integrates signals from both eyes, the flicker response is thereby cancelled, leaving only the motion-selective response, which is of interest here. Duplication of the stimulus would have been irrelevant in the H1 experiments, since H1 receives almost exclusively monocular input.
A. Borst, M.Egelhaaf, and H. S. S u n g
864
Figure 4 (A)The integrated responses shown in Figure 3A of a two-dimensional array of correlation-typemotion detectors, but fed through a first-order low-pass filter with a 3 sec time constant. Note that responses to horizontal and vertical motion are differently scaled. (8) Averaged optomotor turning responses (* SEMI obtained from 10 flies of the species Musca dornestica each tested 20 times for each stimulus condition. Clockwiseturning tendencies are shown as positive signals, and counterclockwise as negative signals. The stimulus trace indicates the visible part of the circle at various instants in time. Stimulus conditions were the same as for the electrophysiological recordings (Figure 3B) except for the following: (1)Stimuli were presented on either side of the fly. (2) The square aperture had an extent of 60" as seen by the fly. (3) The aperture was centered at 45" horizontal position and 0" vertical position with respect to the fly.
A plausible answer is that under natural conditions, the fly does not confound directions as dramatically as it does with our artificial stimulus. For the great majority of ecologically relevant stimuli, it may be that the spatially integrated response is very close to the direction of motion. Recall that the symmetry of the full circle led to exact cancellation of the simulated transverse response. For a natural pattern, such exact cancellation is no doubt rare, but there may be an approximate cancellation due to statistical averaging over the complex shapes in the pattern. Appendix: Continuum Approximation Consider an image that consists of a circle of radius ro with luminance 1 surrounded by a background of luminance 0. This can be written in polar coordinates as
I(r, 0) = Q ( T ~- T)
(A.1)
Motion Perception in Flies
865
where 0 ( x ) is the Heaviside step function. Because of the spatial lowpass properties of the fly eye, the effective input to the detector array is a smoothed form of (A.l), which we can write as W l O )
=f(d
(A.2)
The precise form off is not important in what follows. What is important is that the radius YO is much larger than the scale of the smoothing, so that f’(r) is negligible except for T x yo. This visual stimulus, moving at velocity v, is input to an array of orthogonally oriented detector pairs. The response of a pair is given by equation 2.1. Each detector has sampling base Ax and delay time At. In the continuum approximation to equation 2.1, the local response d(x,y) is related to the velocity vector v by an expression of the form (Reichardt 1987; Reichardt and Schlogl 1988)
For a circular stimulus, the response matrix m is
m(r,d) =
e
sin 8 cos 0 ) a(r) sin2 - sin’ e sin 6 cos e b(r) (sinecose - cos2e COS~
(sin 0 cos 6 +
)
(A.4)
where a(r) = f ’ ( r ) ’ - f ( r ) f ” ( r ) , and b(r) = f ( r ) f ’ ( r ) / r . The off-diagonal element mxy,the transverse response, is of special interest. It is the response of the xdetector to motion in the y-direction. The diagonal element mxx is the longitudinal response, i.e., the response of the x-detector to motion in the x-direction. Assuming that the detector array is a square lattice of spacing Ax, the integrated output is
where M is the integrated response over the portion of the circle that is visible. If the full circle is visible, the off-diagonal elements of the integrated response vanish, so that 1 0 M = f f ( o 1)
(A.6)
where
7ri drra(r) m
ff R5
(A.7)
866
A. Borst, M. Egelhaaf, and H. S. Seung
The integral of the b(r) term in A.4 has been neglected, since it is much smaller than the a ( r ) term. Because M is proportional to the identity matrix, the integrated response vector D is in the same direction as the stimulus velocity v. For the stimulus used in these experiments, a semicircle moving behind a square aperture (shown in Fig. 2), the integrated response matrix is
where y o( t is the fraction of the semicircle that is visible in the aperture. These formulas are valid for the first half of the stimulus period, when y is increasing from 0 to 1. The formulas for the second half of the period are derived similarly. The full response curves are shown in Figure 2. Acknowledgments We are grateful to K. G. Gotz, W. Reichardt, and J. M. Zanker for carefully reading the manuscript. We also thank the people from the summer 1990 Woods Hole course "Neural Systems and Behavior," where this work was started, for the supportive and stimulating atmosphere. We especially thank B. Mensh and C. Gilbert for assistance in the early stages of this investigation. References Adelson, E. H., and Movshon, J. A. 1982. Phenomenal coherence of moving visual patterns. Nuture (London) 300, 523-525. Borst, A,, and Egelhaaf, M. 1989. Principles of visual motion detection. Trends Neurosci. 12,297-306. Borst, A., and Egelhaaf, M. 1990. Direction selectivity of fly motion-sensitive neurons is computed in a two-stage process. Proc. Natl. Acud. Sci. U.S.A. 87, 9363-9367. Buchner, E. 1976. Elementary movement detectors in an insect visual system. Biol. Cybern. 24, 85-101. Buchner, E., Gotz, K.G., and Straub, C. 1978. Elementary detectors for vertical movement in the visual system of Drosophila. B i d . Cybern. 31,235-242. Egelhaaf, M.1987. Dynamic properties of two control systems underlying visually guided turning in house-flies. J. Comp. Physiol. A161, 777-783. Egelhaaf, M., and Borst, A. 1993. Motion computation and visual orientation in flies. Comp. Physiol. Biochem. (in press). Egelhaaf, M., Hausen, K., Reichardt, W., and Wehrhahn, C. 1988. Visual course control in flies relies on neuronal computation of object and background motion. Trends Neurosci. 11,351-358.
Motion Perception in Flies
867
Egelhaaf, M., Borst, A., and Reichardt, W. 1989. Computational structure of a biological motion-detection system as revealed by local detector analysis in the fly’s nervous system. I. Opt. SOC. Am. A6,1070-1087. Fermi, G., and Reichardt, W. 1963. Optomotorische Reaktionen der Fliege Musca domestica. Abhiingigkeit der Reaktion von der Wellenlange, der Geschwindigkeit, dem Kontrast und der mittleren Leuchtdichtebewegter periodischer Muster. Kybernetik 2, 15-28. Ferrera, V. P., and Wilson, H. R. 1990. Perceived direction of moving twodimensional patterns. Vision Res. 30, 273-287. Gotz, K. G. 1964. Optomotorischeuntersuchungen des visuellen systems einiger augenmutanten der fruchtfliege Drosphila. Kybernetik 2, 7i-92. Gotz, K. G. 1972. Principles of optomotor reactions in insects. Bibl. Opthal. 82, 251-259. Hassenstein, B., and Reichardt, W. 1956. SystemtheoretischeAnalyse der Zeit-, Reihenfolgen-und Vorzeichenauswertungbei der Bewegungsperzeptiondes RiisselkiifersChlorophanus. Z. Natut.forsch. llb,513-524. Hausen, K. 1976. Functional characterization and anatomical identification of motion sensitive neurons in the lobula plate of the blowfly Calliphora erythrocephala. Z . Naturforsch 31c, 629-633. Hausen, K. 1982. Motion sensitive interneurons in the optomotor system of the fly. I. the horizontal cells: Structure and signals. Biol. Cybern. 45, 143-156. Hausen, K., and Egelhaaf, M. 1989. Neural mechanisms of visual course control in insects. In Facets ofvision, D. G. Stavenga and R. C. Hardie, eds., Chap. 18, pp. 391424. Springer-Verlag,Berlin. Hengstenberg, R. 1982. Common visual response properties of giant vertical cells in the lobula plate of the blowfly Calliphora. 1. Comp. Physiol. A149, 179-193. Hildreth, E. C., and Koch, C. 1987. The analysis of visual motion: From computational theorem to neuronal mechanisms. Annu. Rev. Neurosci. 10,477-533. Horn, B. K. P., and Schunk, B. G. 1981. Determining optical flow. Artif. Intell. 17,185-203. McCann, G. D., and Dill,J. C. 1969. Fundamental properties of intensity, form, and motion perception in the visual nervous system of Calliphora phaenicia and Musca domestica. 1.Gen. Physiol. 53, 385-413. Movshon, J. A., Adelson, E. H., Gizzi, M. S., and Newsome, W. T. 1986. The analysis of moving visual patterns. Exp. Brain Res. 11, 117-152. Reichardt, W. E. 1987. Evaluation of optical motion information by movement detectors. 1.Comp. Physiol. A161, 533-547. Reichardt, W., Egelhaaf, M., and Schlogl, R. W. 1988. Movement detectors provide sufficient information for local computation of 2-d velocity field. Natumissenschajten 75, 313-315. Reichardt, W. 1961. Autocorrelation, a principle for the evaluation of sensory information by the central nervous system. In Sensory Communication, W. A. Rosenblith, ed., pp. 303-317. MIT Press and J. Wiley, New York. Reichardt, W., and Guo, A.-K. 1986. Elementary pattern discrimination (behavioural experiments with the fly Musca domestica). Biol. Cybern. 53, 285306.
868
A. Borst, M. Egelhaaf, and H. S. S u n g
Reichardt, W. E., and Schlogl, R. W. 1988. A two-dimensional field theory for motion computation. Bid. Cybern. 60, 23-35. Srinivasan, M. V. 1990. Generalized gradient schemes for the measurement of two-dimensional image motion. Bid. Cybern. 63,421-431. Stoner, G. R., Albright, T. D., and Ramachandran, V. S. 1990. Transparency and coherence in human motion perception. Nature (London) 344, 153-155. Stone, L. S., Watson, A. B., and Mulligan, J. B. 1990. Effect of contrast on the perceived direction of a moving plaid. Vision Res. 30, 1049-1067. Ullman, S. 1983. The measurement of visual motion. Trends Neurosci. 6,177-179. Uras, S., Girosi, F., Verri, A., and Torre, V. 1988. A computational approach to motion perception. Bid. Cybern. 60, 79-87. Verri, A., and Poggio, T. 1989. Motion field and optical flow: Qualitative properties. IEEE Trans. PAMI 11, 490-498. Received 28 August 1992; accepted 10 March 1993.
This article has been cited by:
Communicated by Bruce McNaughton
Neural Representation of Space Using Sinusoidal Arrays David S. Touretzky A. David Redish Hank S. Wan School of Computer Science, CarnegieMellon University, Pittsburgh, PA 15213 USA O'Keefe (1991)has proposed that spatial information in rats might be represented as phasors: phase and amplitude of a sine wave encoding angle and distance to a landmark. We describe computer simulations showing that operations on phasors can be efficiently realized by arrays of spiking neurons that recode the temporal dimension of the sine wave spatially. Some cells in motor and parietal cortex exhibit response properties compatible with this proposal. 1 Introduction
Any vector in polar coordinates v = (Y, 4) can be represented as a sine wave fit) = r cos(wt + 4), where r is amplitude, 4 is phase, and w is (constant) frequency. This is commonly known as a phasor. The advantage of phasor representation is that translation and rotation of a vector are both trivial operations. Translation is achieved by addition of sine waves, and rotation can be obtained by phase shifting or temporal delay. O'Keefe (1991) suggested that rats might use phasors to encode angle and distance to landmarks. In his proposal, hippocampal theta provides the reference signal for determining phase. This temporal approach to encoding a sine wave has some drawbacks. The 7-12 Hz theta rhythm may be too slow to support real-time spatial reasoning tasks requiring rapid manipulation of phasors. Furthermore, maintaining even a modest angular resolution of 10" relative to a roughly 10 Hz reference signal requires a temporal resolution of 3 msec. Although some specialized sensory systems are known to make much finer discriminations (e.g., acoustic imaging in bats and dolphins, or auditory localization in barn owls), we are reluctant to require this degree of temporal precision at the higher cognitive level associated with spatial reasoning. Instead, we suggest that phasor operations are more plausibly realized by recoding the temporal dimension of the sine wave spatially, using populations of spiking neurons. We propose an architecture called the sinusoidal array for manipulating vectors in phasor form, and report the results of computer simulations. Neural Computation 5,869-884 (1993) @ 1993 Massachusetts Institute of Technology
870
David S.Touretzky, A. David Redish, and Hank S. Wan
There is some experimental evidence that sinusoidal array representations may exist in rat parietal cortex and in rhesus motor or parietal cortex. We propose an experiment to test this hypothesis in rats. 2 Sinusoidal Arrays
To encode a phasor as a sinusoidal array, we replace the continuous temporal signal f i t ) by a distributed pattern of activity over an array of N elements, as in Figure 1. The value encoded by the ith array element is the amplitude of the sine wave sampled at point 27rilN. That is, the activity level of the ith array element encoding the vector (Y, 4) is given by f ( ~ 4, ,i) = Y cos(4 + 27ri/N), for 0 5 i < N. Note that for the special case of N = 4, the sinusoidal array encoding is exactly the Cartesian encoding ( x ,y, -x, -y), where x = Y cos 4 and y = Y sin 4. Each sinusoidal array element is a collection of neurons. Its activity level is encoded by the neurons' average firing rate, or equivalently, the average percentage of neurons firing at any instant. If the neuronal p o p ulation is sufficiently large, this representation can encode values with high precision even when individual neurons are noisy and have a limited number of discriminable firing rates. In order to be able to represent the negative half of the sine wave, neurons in a sinusoidal array fire at a rate F(Y,4, i) = k.f(r, 4, i) + b, where k is a gain parameter and b the baseline firing rate. In our simulations, the baseline firing rate is 40 spikes/sec. This gives the neuron a dynamic range of 0-80 Hz,which is compatible with cells in parietal cortex. A significant advantage of the sinusoidal array representation is that it allows coordinate transforms to be done nearly instantaneously. If the signal fit) were represented temporally, the simplest way to determine its
Figure 1: The phasor (r, 4) and its sinusoidal array representation.
Neural Representation of Space
871
phase would be to wait for the peak. But one might have to wait up to one full period, which would be 140 msec for a 7 Hz signal. Alternatively, the phase could be determined by estimating the slope f ' ( t ) and taking the arc cosine, but this solution seems less neurally plausible than the spatial encoding investigated here. A drawback of the sinusoidal array representation is that angular resolution is limited to 27rlN. But even modest values of N appear to give sufficient resolution for navigation tasks. In our simulations we chose N = 24, giving an angular resolution of f7.5". 3 Vector Operations with Sinusoidal Arrays
In order to successfully complete a landmark-based navigation task, an animal must perform some coordinate transformations to derive a goal location from observed landmark positions. These transformations include at least translation, probably negation, and perhaps also rotation. In a phasor-based coordinate system, translation of a vector is accomplished by adding the corresponding sine waves, e.g., f(t) = f ~ ( t )+ f Z ( t ) . In the sinusoidal array representation, translation is accomplished by element-wise linear addition of firing rates: F (i) = FI(i) F2(i)- b, for 0 5 i < N. The subtraction of one baseline value b normalizes the result; it can be accomplished by giving the summation neuron an inhibitory bias term equal to the baseline firing rate. Negation of a vector can be accomplished in a variety of ways. Given a maximum activity level M for an element, we can compute F ( i ) = M Fl(i) for 0 5 i < N by using inhibitory connections to units whose baseline firing rate is M. However, since negation of a vector is usually required only as part of a vector subtraction operation, the easiest solution may be to use the addition mechanism described in the previous paragraph, but with one of the vectors rotated by 180". This gives F ( i ) = Fl(i) Fz(i N/2 mod N) - b. If translation and negation were the only required operations, there would be no advantage to using phasors. All computations could be done in Cartesian coordinates, using any neural encoding that correctly maintained the independent values x and y. However, when rotation is introduced, x and y are no longer independent. And since rotation in a Cartesian system is a nonmonotonic function, it is not easily computed with neuron-like units. (Of course, rotation is linear in a polar coordinate system, but then translation becomes nonmonotonic.) A point f ~ ( t )in phasor form can be rotated by cr radians about the origin by simply computing f(t) = f,(t + a). We implement the equivalent operation in sinusoidal arrays by rotating the array, using shifter circuitry similar to that proposed in Anderson and Van Essen (1987) and Olshausen et al. (1992). The shifter takes two vectors as input: one is the signal to be shifted, while the other specifies the angle of rotation. (The
+
+
+
David S. Touretzky, A. David Redish, and Hank S. Wan
872
prrmulation
channels
Figure 2: Schematic diagram of the shifter circuit. The signal entering at right goes through a contrast enhancement stage and winner-take-allphase detector, which determines the amount by which the input sine wave (top left) should be shifted. Light-colored lines indicate lateral inhibition connections. Only a subset of the permutation channel connections is shown. amplitude of the latter sine wave is ignored.) The shifter itself has two components: a I-of-N phase detector and a set of N gated permutation channels, as shown in Figure 2. The I-of-N phase detector contains one neuron for each of the N sinusoidal array elements. These neurons integrate their inputs over a brief time interval; the one receiving the largest input reaches threshold and fires first. We think of these phase detector neurons as similar to fastspike inhibitory interneurons. They have small refractory times and two sorts of postsynaptic effect: a short timescale inhibition of other phase detector cells (whose recovery from inhibition initiates a new winnertakeall round), and a long timescale inhibition that acts as a gating signal for channel-inhibitory neurons in the second half of the shifter. The shifter’s N gated permutation channels each copy the activity of the N-element input array to the N-element output array, permuting the elements along the way. When the jth channel is active, it copies the activation of input element i to output element i-j mod N, for 0 5 i < N. The channels have associated with them tonically active channel-inhibitory neurons that keep them silent most of the time. These are the same type of inhibitory units as the phase detector neurons, except that their only inputs are inhibitory. When a neuron in the phase detector fires, it
Neural Representation of Space
a73
inhibits the corresponding channel-inhibitory neuron, thereby disinhibiting the channel and allowing the shifted sine wave to appear across the output array. Anderson and Van Essen (1987) describe a shifter using log, N levels where each level has two permutation channels, giving O(N) connections. In a refinement of this model, Olshausen et al. (1992) use four levels with varying numbers of nodes, and fan-ins of approximately 1000, mirroring the connectivity of cortical areas V1, V2, V4, and IT. Because our own N is so small (N = 24 in the simulations), we can use a single level with N channels and O ( V ) connections. Aside from the obvious advantage of simplicity of connection structure, this allows us to use a simple 1of-N representation for the amount by which to shift, rather than the more complex binary encoding required by Anderson and Van Essen, or the distributed encoding of Olshausen et al. O u r model is also simpler because it requires only shunting inhibition, whereas theirs requires multiplicative synapses. The shifter circuit is not central to our theory. As discussed in the next section, many rodent navigation tasks can be performed using just translation. However, in situations where the reference frame must be determined anew on each trial based on the orientation of a cue array, there does appear to be a need for mental rotation of some sort. The shifter offers a solution to the general problem of rotation of vectors. But for some navigation tasks, the animal could instead slew its internal compass.
4 Rodent Navigation
In a remarkable series of experiments, Collett, Cartwright, and Smith investigated landmark learning behavior in gerbils (Collett et al. 1986). We will describe two of their simpler experiments here. Figure 3 shows the result of training a gerbil to find a food reward at a constant distance (50 cm)and compass bearing from a cylindrical landmark. The landmark was moved to a different location at each trial. Once trained (requiring on the order of 150 trials), the gerbil proceeded directly to the goal location from any starting location, and spent most of its time searching in the goal location. To model this behavior, we assume that the gerbil has learned that a constant memory vector M describes the remembered angle and distance of the landmark from the goal. On each trial, the gerbil's perceptual apparatus produces a vector P that describes the location of the landmark relative to the animal's current position. Thus, the position of the goal relative to the animal can be computed by vector subtraction: G = P - M. Collett et al. (1986)show that the animal must be computing this location, rather than simply moving to make its view of the landmark match a
David S. Touretzky, A. David Redish, and Hank S. Wan
874
0
0
S
F
S
S S
s Training
s Testing
Figure 3: Learning to find food at a constant distance and bearing from a landmark, after Collett et al. (1986) and Leonard and McNaughton (1990). S marks sample starting locations; F is food location; solid circle is the landmark; concentric circles show distribution of search time when trying to locate food reward. The majority of search time is spent at the goal location. stored memory of the goal, by turning off the lights after it had begun moving toward the goal. The animal still proceeded directly to the goal. The calculation of the goal location relies on a critical assumption: that the memory vector M, the perceptual vector P, and the goal vector G share the same reference direction, which we call global north. In Collett et al. (1986) this commonality was attributed to "unspecified directional cues." Recently it has been shown that rodents possess a highly accurate internal compass, which allows them to judge head direction even in the absence of visual cues (Chen et al. 1990; Chen 1991; Taube et al. 1990a,b). The compass is not related to magnetic north; it is maintained in part by integrating angular accelerations over time (Mittelstaedt and Mittlestadt 1980; Etienne 1987; Markus et al. 1990). McNaughton (personal communication) has observed that rats taken from their home cages will maintain their compass as they are carried into another room for experiments, so they have a consistent reference frame available even if the experimental environment is poor in directional cues.' This is significant because an internal compass that provides a stable north across all learning trials allows many simple navigation tasks to be performed without resorting to mental rotation. In our simulation of the Collett et al. task, we assume that the per'However, the compass can be confused if the box used to transport the animal is spun through several revolutions.
Neural Representation of Space
875
.
S
S
L1
F OL 2
S
S S
~~
Training
Testing
Figure 4 Learning to find food at a constant distance and bearing relative to a rotating cue array, after Collett et al. (1986).
F' Figure 5: Bearing a to food reward is measured with respect to the line joining landmarks L1 and L2, not global north. ceptual and memory systems orient their respective vectors, P and M, using the same internal compass. The sinusoidal array then computes G by vector subtraction. Figure 4 shows a more demanding experiment in which the cue array is rotated as well as translated on each trial. The bearing of the food reward is constant with respect to the line joining landmarks L1 and L2,as shown in Figure 5, but not with respect to the more salient cue provided by the internal compass. Tasks of this sort, in which bearings must be measured with respect to the cue array, should be more difficult to learn (Collett et d ' s observations support this), and would seem to require mental rotation.
876
David S. Touretzky, A. David Redish, and Hank S. Wan
.
Figure 6 Computations involved in solving the task shown in Figure 4. PI,Pz: coordinates of landmarks in the perceptual reference frame; M1,Mz: remembered coordinates of landmarks as seen from the food location; R: rotational alignment factor; G: computed goal location in the perceptual reference frame. Here is one way to solve the task in Figure 4. The line joining the two landmarks defines a "local north consistent across trials for the reference frame in which the memory vectors MI and MZare expressed. The perceptual vectors PI and PZ are expressed relative to the animal's internal compass, which is not aligned with this reference frame. If local north did coincide with the internal compass on some particular trial, then MI -MZ would equal PI -Pz, and the goal vector G would be simply P2 - Mz. (It would also be equal to P1 - MI, but the closer landmark is likely to give a more accurate solution.) In general, though, we will have to bring the two reference frames into alignment before locating the goal. The vector from the second to the first landmark, MI - MZin the memory frame, should correspond to the vector P1 - PZ in the perceptual frame. The required rotational alignment factor is therefore Phase(M1 - Mz) - Phase(P1- P z ) . Let rot(v, w) denote the antirotation of vector o by the phase of w. In other words, let rot(v, w) have the same magnitude as v, but phase equal to Phase(v) - Phase(w). Then the rotational alignment factor we require is equal to the phase of R = rot(M1 - Mz,PI - Pz), and the goal location is given by G = Pz - rot(M2,R). Each of the two rotation operations can be computed by the shifter described earlier. Our computer simulation of this task involves three vector subtractions and two rotations, as shown in Figure 6. We are not suggesting that rodent brains are wired to perform this specific computation; it seems more likely that some general spatial reasoning mechanism is involved. But the mechanism's primitive operations are likely to include translation and rotation. Our simulation shows that a combination of five of these operations is sufficient for performing the task in Figure 4. This number
Neural Representation of Space
877
can be reduced to four if the vector MI - M2 is remembered rather than computed on the fly. Further simplifications are possible. Instead of aligning memory with perception to compute P2 - rot(M2,R), the animal could slew its internal compass by the phase of R, realigning its perception to match memory. Then it need only compute Pi -M2, where P2 is the new perceptual vector measured with the slewed compass. McNaughton (personal communication) reports that rats do in fact realign the preferred directions of their head direction cells when the visual world is rotated at a perceptible rate, provided that the environment is familiar to them. Slewing the compass keeps landmarks at their learned headings. In unfamiliar environments the animal does not respond to rotation this way; it maintains its compass using inertial cues, as it does in the dark.
5 Details of the Computer Simulations
Our computer simulations are based on abstract neuron models that are
simpler than compartmental models, but retain many important properties of real neurons, such as spiking behavior. The simulation uses two types of neurons: pyramidal cells and inhibitory interneurons. Sinusoidal arrays contain 24 elements with 100 pyramidal cells each. Our abstract pyramidal cell has a resting potential of 0, a threshold 0 = 1, and a typical fan-in of 20 (but as high as 240 in the shifter) with synaptic weights of 0.1. It sums its inputs over time, and when it reaches threshold, it fires a spike. Spiking is treated as an instantaneous event, i.e., it lasts for one clock tick, after which the cell zeros its net activation and enters a refractory state. For the experiments reported here, a clock tick, At, is 0.1 msec. The cell's refractory period is 1/80 sec, limiting the peak firing rate to 80 Hz.It is important that the clock rate be significantly faster than peak firing rate, so that inputs are not lost when a cell zeros its net activation. Only impulses arriving at the exact moment a cell spikes will be lost; during the refractory period the cell continues to integrate its inputs. Pyramidal cells make up the summation module used for addition and subtraction of phasors. Cells in the summation module receive excitatory inputs from two sinusoidal arrays, following the equation F ( i ) = F1(i) + F 2 ( i ) - b. A neuron in the ith array element will receive inputs from 10 randomly chosen neurons from the ith element of each input array. The bias term b = 40 Hz is implemented by decrementing the net activation by -b . 0 . At every clock tick, but the total activation of the cell is not permitted to go below 0. Pyramidal cells also make up the output array of the shifter module. These cells have a fan-in of 240 since they receive 10 inputs from each of N permutation channels. They do not require a bias term.
878
David S. Touretzky, A. David Redish, and Hank S. Wan
The second type of model neuron is a fast-spike inhibitory interneuron used in the shifter. Both the phase detector neurons and the permutation channel inhibitory neurons are of this type. It has a resting level of 0 and a threshold of I, like the pyramidal cell, but the refractory period is only 5 msec. The firing of a phase detector neuron has two distinct effects. First, it inhibits all the other phase detector neurons, essentially setting their net activation to zero. Second, it inhibits the corresponding channelinhibitory neuron, allowing the permutation channel to open. Lateral inhibition of phase detector cells should have a short time course, so that when a neuron loses the race to fire first it can reenter the competition in short order. But channel-inhibitory neurons should be inhibited for a relatively long time, because we do not want the channel to close again between successive firings of its controlling phase detector. In cortex, GABAAinhibition has a short time course, while GABAe inhibition has a long time course. It therefore does not seem unreasonable to posit different inhibitory effects arising from the same interneuron, The channel-inhibitory neurons, when not themselves inhibited, shut down the permutation channel. This could be accomplished in real neural systems in several ways. If we assume that the ith channel's bundle of connections from input cells to a cell in the shifter's output array are distributed throughout the output cell's dendritic tree, then shutting down the channel would require inhibitory axoaxonic synapses at many select sites. But if connections comprising the ith channel were localized to a specific region of the output cell's dendritic tree, the channel-inhibitory interneuron would require only a single synapse onto the base of this subtree. Because our simulation is not at the compartmental level, we do not distinguish between these possibilities in our model. We add noise to the model by perturbing each cell's activation level by a small random value at each clock tick. For 5% noise, we use perturbations in the range f0.025M- 0 * At, where M is the cell's maximum firing rate. Small amounts of noise actually improve the shifter's performance by preventing the output cells within an array element from synchronizing with each other due to inhibition from contrast enhancement, described below? Noise also prevents a phase detector cell from consistently winning the race to inhibit its neighbors just because the cells that synapse onto it happened to start out with a slightly higher initial activation level. A technical problem with the shifter suggests that we may want to add basket cells to our model. We found that for the shifter to work correctly, the phase detector must be producing a stable output, i.e., reporting a consistent phase. However, when the sine wave input to the phase detector is of small amplitude, the peak can be difficult to determine precisely, so the phase detector's output wanders among several *Synchronizationwould cause anomalous behavior in any phase detector that used this signal as input, unless the phase detector cells integrated their inputs over a much longer time period.
Neural Representation of Space
879
nearby values. This results in the opening of different permutation channels at different times, degrading the shifter’s output representation. To prevent this, we introduced a contrast enhancement layer with a form of ”center-surround” feedback inhibition to preprocess the phase detector’s input and make the peak easier to find. In real neural systems, this type of inhibitory feedback is thought to be provided by basket cells (McNaughton and Nadel 1990). The details of our model’s contrast enhancement mechanism are a bit ad hoc at present and are in need of refinement, but preliminary results show that it does result in correct and stable phase detector output. If the inhibitory feedback is set at a high level, the contrast enhancement process yields an array representation with only one active element, thereby anticipating the winner-takeall function of the phase detector. However, for a range of lower values, instead of winner-take-allbehavior contrast enhancement produces cells with triangular response functions. The firing rates of these cells peak at a certain preferred direction, fall off roughly linearly within 30-60 degrees of that direction, and are elsewhere flat and close to zero. As discussed in the next section, cells with this behavior have been found in postsubiculum by Taube et al. We have also run simulations varying the number of neurons in a sinusoidal array. There was no appreciable advantage to doubling the number to 200 neurons per element. There was a slight penalty for using only 50 neurons: it took longer for the shifter to settle down and produce a consistent output signal, because contrast enhancement had to be done more slowly to avoid errors. With 20 neurons per element the system was unstable. 6 Experimental Evidence for Sinusoidal Arrays
A necessary condition for sinusoidal arrays to exist in cortex is the presence of cells whose response pattern obeys the function F(r, 4) = b k r cos 4, where distance r and angle 4 are measured either egocentrically or allocentrically. Georgopoulos et al. have formulated a similar equation, d ( M ) = b k cos(8cM),to describe the behavior of neurons in rhesus parietal cortex (Kalaska et al. 1983) and motor cortex. These neurons have firing rates proportional to the cosine of the angle between a “preferred direction vector” C and an intended reaching vector M. Different cells have different directional preferences3and hence different firing rates for a given movement. Their collective activity forms a “neural population vector” that can express any angle of intended motion. Another important piece of evidence in support of the sinusoidal array hypothesis is the finding in rats of cells that encode head direction with
+
s
+
3The preferred direction C plays the role of the array position i in our formula for F(r, hi).
David S. Touretzky, A. David Redish, and Hank S. Wan
880
0
90
180 b a d Dlroctlon (dog)
270
360
Figure 7 Tuning curves for a cell in parietal area Oc2M when the animal is motionless or making a left or right turn. Modified from Chen (1991, p. 118).
respect to either a visual landmark or an inertial reference frame. These cells appear to be part of the animal's internal compass referred to earlier. Taube et al. (1990a)report head-direction sensitive cells in postsubiculum with sharp directional preferences that are independent of the animal's location in the environment. When a prominent landmark is shifted along the wall of a cylindrical chamber, the cells' directional tuning curves are shifted by a comparable amount, indicating that the animal is using visual cues to maintain the compass (Taube et al. 1990b). The cells Taube et al. describe have triangular tuning curves, not sinusoidal ones. But Chen et al. (19901, recording from parietal/retrosplenial association cortex, also found head-direction sensitive cells, and in some cases the response pattern was a cosine function. Figure 7 shows the tuning curve of one such cell described in Chen (1991). The crucial question for both the rat and primate data is whether cells with a sinusoidal response function are also sensitive to distance. Schwartz and Georgopoulos (1987) have found this to be the case in rhesus motor cortex. They first varied the angle of a constant-distance target in a reaching task, to determine the preferred direction for each cell. Subsequently, they varied the distance between the animal and the target when the target was located at the cell's preferred direction. They report a substantial number of direction-sensitive cells with weak but statistically significant linear response as a function of target distance. In the case of the rat parietal recordings, to measure sensitivity to distance the animal would have to be attending to some known location. One way to accomplish this would be to train the rat to perform a landmark-based navigation task as in Figure 3, and then look
Neural Representation of Space
881
for direction-sensitive parietal cells whose response varied linearly with distance to either the landmark or the goal. 7 Discussion
Hippocampal theta may play some role as a reference signal for navigation, but it is probably not related to compass direction. OKeefe and Recce (1992) report that the phase at which place cells fire relative to the theta rhythm varies through 360" as the animal enters, proceeds through, and exits the cell's place field. This has led Burgess, OKeefe, and Recce to propose a navigation model in which phase information is used to distinguish entering vs. exiting. In conjunction with head direction information and a separate layer of goal cells, the net firing field of subicular place cells at phase 0" is peaked ahead of the rat, allowing the animal to navigate by homing to a goal location (Burgess et al. 1993). The Burgess et al. model has a number of interesting properties, but it cannot deal with complex navigation tasks of the sort Collett et al. have studied, with cue arrays that change position and orientation from trial to trial. While the hippocampus is known to play an important role in spatial behavior, researchers such as Nadel (1991) claim that its role is spatial memory, not planning and navigation. Parietal cortex appears to be involved in these latter tasks (Stein 1991). McNaughton et a2. (1991)propose a model of directional sense based on both vestibular sensations and visual cues. In darkness or unfamiliar environments, the animal maintains its compass by inertial means, using an associative memory "table lookup" scheme to compute its new heading from the old heading plus angular acceleration. But in familiar environments, "local view" cells (possibly hippocampal place cells) adjust the compass to agree with the learned heading associated with that view direction. We mentioned previously that compass slewing might replace the second rotation when performing Collett et al.'s rotating cue array task. McNaughton (personal communication)suggested that if local view cells can determine compass direction by direct matching of visual landmarks, the first subtraction and rotation steps might also be eliminated, leaving just one vector subtraction. We agree with the notion that distant landmarks should control the animal's compass in familiar environments. But it seems less plausible that viewing a configuration of nearby landmarks would provide sufficiently accurate heading information to solve the rotating cue array task by table lookup, because the view could change significantly with relatively small translations. Hence we believe at least one mental rotation step is required. Elsewhere in their paper McNaughton et al. speculate that trajectory computations (vector subtractions) might be done by the same table lookup mechanism as they propose for updating the inertial compass.
882
David S. Touretzky, A. David Redish, and Hank S. Wan
The drawback of this proposal is the large table that would be required to represent all possible pairs of vectors, and the cost of filling in the entries. The sinusoidal array appears to offer a simpler solution for vector arithmetic. The neural architecture we have described is compatible with properties of parietal cortex. It manipulates phasors as sinusoidal arrays, but it does not explain how such representations arise in the first place. We simply assume that perceptual and memory vectors are available in the required form. We defend this assumption by noting that sinusoidal encodings of angles have already been found in rats and monkeys. Indications of a linear sensitivity to distance in rhesus sinusoidal cells reported by Schwartz and Georgopoulos offer additional support. At this point, the most important test of our model is whether rat parietal cells can be found with cosine response functions that are also linearly sensitive to distance. Two other properties of our model are worth noting. As presently formulated, all cells in a sinusoidal array element have the same preferred direction (as do cells in a single orientation column in visual cortex), so there are only N directions represented. If the preferred directions of real parietal cells are found to cluster into a small number of discrete, evenly spaced values, this would be strong evidence in favor of the sinusoidal array hypothesis. However, we expect our model would also function correctly using input units with preferred directions smoothly distributed around the circle, so that neurons in bin i had a preferred direction somewhere within 2.rr(if0.5)/N.We have not yet verified this experimentally, however. Due to the many-to-one connectivity of pyramidal cells, units in the output sinusoidal array should still show preferred direction values close to the centers of their respective bins. The model also assigns the same scale factor k to all neurons in an array. But experimenters report a wide range of peak firing rates for direction-sensitive cells in both postsubiculum and parietal cortex. We again expect the model to function correctly under this condition, assuming only that the mean scale factor is the same across elements.
Acknowledgments This work was supported by a grant from Fujitsu Corporation. Hank Wan and David Redish were supported by NSF Graduate Fellowships. We thank Bruce McNaughton and an anonymous referee for helpful comments on an earlier draft of this paper, and Longtang Chen for permission to reproduce one of his figures.
Neural Representation of Space
883
References Anderson, C. H., and Van Essen, D. C. 1987. Shifter circuits: A computational strategy for dynamic aspects of visual processing. Proc. Natl. Acud. Sci. U.S.A. 84,1148-1167. Burgess, N., OKeefe, J., and Recce, M. 1993. Using hippocampal ‘place cells’ for navigation, exploiting phase coding. In Advances in Neural Information Processing Systems 5, S . Hanson, J. Cowan, and L. Giles, eds., pp. 929-936. Morgan Kaufmann, San Mateo, CA. Chen, L. L. 1991. Head-directional information processing in the rat posterior cortical areas. Doctoral dissertation, University of Colorado. Chen, L. L., McNaughton, B. L., Barnes, C. A., and Ortiz, E. R. 1990. Headdirectional and behavioral correlates of posterior cingulate and medial prestriate cortex nuerons in freely-moving rats. SOC.Neurosci. Abstr. 16,441. Collett, T. S., Cartwright, B. A., and Smith, 8. A. 1986. Landmark learning and visuospatial memories in gerbils. J. Comp. Physiol. A 158, 835-851. Etienne, A. S. 1987. The control of short-distance homing in the golden hamster. In Cognitive Processes and Spatial Orientation in Animals and Man, I? Ellen & C. Thinus-Blanc, eds.,pp. 233-251. Martinus Nijhoff, Dordrecht. Georgopoulos, A. P., Schwartz, A. B., and Kettner, R. E. 1986. Neuronal population coding of movement direction. Science 233,1416-1419. Kalaska, J. F., Caminiti, R., and Georgopoulos, A. P.1983. Cortical mechanisms related to the direction of two-dimensional arm movements: Relations in parietal area 5 and comparison with motor cortex. Exp. Brain Res. 51, 247260. Leonard, B., and McNaughton, B. L. 1990. Spatial representation in the rat: Conceptual, behavioral, and neurophysiological perspectives. In Neurobiology of Comparative Cognition, R. P. Kesner and D. S. Olton, eds., pp. 363-422. Erlbaum, Hillsdale, NJ. Markus, E. J., McNaughton, B. L., Barnes, C. A., Green, J. C., and Meltzer, J. 1990. Head direction cells in the dorsal presubiculum integrate both visual and angular velocity information. SOC.Neurosci. Abstr. 16,441. McNaughton, B. L., and Nadel, L. 1990. Hebb-Marr networks and the neurobiological representation of action in space. In Neuroscience and Connectionist Theory, M. A. Gluck and D. E. Rumelhart, eds., pp. 1-63. Erlbaum, Hillsdale,
NJ.
McNaughton, B. L., Chen, L. L., and Markus, E. J. 1991. “Dead reckoning,” landmark learning, and the sense of direction: A neurophysiological and computational hypothesis. J. Cog. Neurosci. 3(2), 190-202. Mitttelstaedt, M.-L., and Mittelstaedt, H. 1980. Homing by path integration in a mammal. Natunvissenschaften 67, 566-567. Nadel, L. 1991. The Hippocampus and space revisited. Hippocumpus 1(3), 221229. OKeefe, J. 1991. An allocentric spatial model for the hippocampal cognitive map. Hippocampus 1(3), 230-235. OKeefe, J., and Recce, M. 1993. Phase relationship between hippocampal place units and the EEG theta rhythm. Hippocumpus 3, (in press).
884
David S. Touretzky, A. David Redish, and Hank S. Wan
Olshausen, B., Anderson, C., and Van Essen, D. 1992. A neural model of visual attention and pattern recognition. CNS Memo 18, Computation and Neural Systems Program, California Institute of Technology. Schwartz, A. B.,and Georgopoulos, A. I? 1987. Relations between the amplitude of 2-dimensional arm movements and single cell discharge in primate motor cortex. SOC.Neurosci. Abstr. 13, 244. Stein, J. 1991. Space and the parietal association areas. In Brain and Space, J. Paillard, ed., pp. 185-222. Oxford University Press, New York. Taube, J. S., Muller, R. I., and Ranck, J. 8.J. 1990a. Head direction cells recorded from the postsubiculum in freely moving rats. I. Description and quantitative analysis. 1. Neurosci. 10, 420-435. Taube, J. S., Muller, R. I., and Ranck, J. B. J. 199Ob. Head direction cells recorded from the postsubiculum in freely moving rats. 11. Effects of environmental manipulations. 1.Neumsci. 10,436-447. Received 20 July 1992; accepted 4 March 1993.
This article has been cited by: 2. John L. Kubie, André A. Fenton. 2009. Heading-vector navigation based on head-direction cells and path integration. Hippocampus 19:5, 456-479. [CrossRef] 3. Pierre Baraduc , Emmanuel Guigon . 2002. Population Computation of Vectorial TransformationsPopulation Computation of Vectorial Transformations. Neural Computation 14:4, 845-871. [Abstract] [PDF] [PDF Plus] 4. Darlene M. Skinner, Gerard M. Martin, Christa-Jo Scanlon, Christina M. Thorpe, Jeremy Barry, John H. Evans, Carolyn W. Harley. 2001. A two-platform task reveals a deficit in the ability of rats to return to the start location in the water maze. Behavioral Neuroscience 115:1, 220-228. [CrossRef] 5. Alexandre Pouget, Terrence J. Sejnowski. 2001. Simulating a lesion in a basis function model of spatial representations: Comparison with hemineglect. Psychological Review 108:3, 653-673. [CrossRef] 6. M. Matsuoka, S. Hosogi, Y. Maeda. 1998. Hippocampal neural network model performing navigation by homing vector field adhesion to sensor map. Artificial Life and Robotics 2:3, 129-133. [CrossRef] 7. Sabine Gillner , Hanspeter A. Mallot . 1998. Navigation and Acquisition of Spatial Knowledge in a Virtual MazeNavigation and Acquisition of Spatial Knowledge in a Virtual Maze. Journal of Cognitive Neuroscience 10:4, 445-463. [Abstract] [PDF] [PDF Plus] 8. Asohan Amarasingham , William B. Levy . 1998. Predicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction ModelPredicting the Distribution of Synaptic Strengths and Cell Firing Correlations in a Self-Organizing, Sequence Prediction Model. Neural Computation 10:1, 25-57. [Abstract] [PDF] [PDF Plus] 9. A. David Redish, David S. Touretzky. 1997. Cognitive maps beyond the hippocampus. Hippocampus 7:1, 15-35. [CrossRef] 10. Sofyan H. Alyan. 1996. Evidence for Resetting the Directional Component of Path Integration in the House Mouse (Mus musculus). Ethology 102:4, 629-638. [CrossRef] 11. David S. Touretzky, A. David Redish. 1996. Theory of rodent navigation based on interacting representations of space. Hippocampus 6:3, 247-270. [CrossRef] 12. Georg Hartmann, Rüdiger Wehner. 1995. The ant's path integration system: a neural architecture. Biological Cybernetics 73:6, 483-497. [CrossRef] 13. Thomas Wittmann, Helmut Schwegler. 1995. Path integration — a network model. Biological Cybernetics 73:6, 569-575. [CrossRef] 14. A. David Redish, David S. Touretzky. 1994. The reaching task: evidence for vector arithmetic in the motor system?. Biological Cybernetics 71:4, 307-317. [CrossRef] 15. L. F. Abbott. 1994. Decoding neuronal firing and modelling neural networks. Quarterly Reviews of Biophysics 27:03, 291. [CrossRef]
16. Alexandre Pouget, Lawrence H SnyderModeling Coordinate Transformations . [CrossRef]
Communicated by Lawrence Jackel
Fast Recognition of Noisy Digits Jeffrey N. Kidder Daniel Seligson Intel Architecture Labs, Intel Corporation, Mailstop RN6-25, 2200 Mission College Blvd., Santa Clara, C A 95052 USA
We describe a hardware solution to a high-speed optical character recognition (OCR) problem. Noisy 15 x 10 binary images of machine written digits were processed and applied as input to Intel's Electrically Trainable Analog Neural Network (ETA"). In software simulation, we trained an 80 x 54 x 10 feedforward network using a modified version of backprop. We then downloaded the synaptic weights of the trained network to ETANN and tweaked them to account for differences between the simulation and the chip itself. The best recognition error rate was 0.9% in hardware with a 3.7% rejection rate on a 1000character test set. 1 Introduction We have solved a difficult optical character recognition (OCR)problem using a feedforward neural network configured as a "1 of 10" classifier. To meet the challenging throughput requirements of this application, we have deployed the solution on Intel's Electrically Trainable Analog Neural Network (ETA") chip (Holler et al. 1989). In the OCR problem, we receive a 15 x 10 map from the output of a binary image acquisition system. The characters are the digits 0,1,. ..9, and the digit presentation rate is 12,500 characters per second, requiring a classification time of less than 80 psec. Each digit is machine written with a laser on plastic, but the combined process of writing, processing of the plastic, and image acquisition is very noisy (see Fig. 1). Furthermore, the digits vary somewhat in size from 8 x 5 to 10 x 9 and they drift around in the larger 15 x 10 field.' ETANN is an analog neural network with flexible architecture. For our application, we configured it as an M input (M 5 128), N output, 64 - N hidden node feedforward network with binary inputs and out'The details of the application, e.g., why it was developed and where it is being deployed, are proprietary information. The problems we encountered and the solutions we found while developing it are, we believe, generic and should be disseminated.
Neural Computation
5,885-892 (1993) @ 1993 Massachusetts Institute of Technology
886
JeffreyN. Kidder and Daniel Seligson
Figure 1: Examples of digits before any preprocessing was performed.
puts, but analog hidden units. In this mode, its feedforward execution time is about 6 psec, thereby outperforming the 80 psec classification requirement. As a "1 of 1 0 binary classifier (i.e., N = 10) with M = 80, its forward computation rate is 0.8 x lo9 connections per second, exceeding the capabilities of conventional microprocessors or digital signal processors? Because the binary image as received has 150 pixels and ETA" has only 128 inputs, some compressive preprocessing was required. The scheme we developed required approximately 200 integer additions and 14 comparisonsper pattern. This amounts to a few million operations per second given the pattern presentation rate requirements and is easily achievable with commercial microprocessors. In the sys'ETANN's peak computation rate, achieved with a 128x 64 network, having a 3 psec execution time is 2.7 x lo9.
Fast Recognition of Noisy Digits
887
tem prototype deployed thus far, the actual throughput achieved is 22 characters per second, as compared to the 167,000 character per second theoretical throughput of ETA“. The difference is due to bandwidth limitations of the development system. Initially we simulated E T A ” in software and used a modified version of backprop (Rumelhart et al. 1986) to train it. The weights were then downloaded to an E T A ” development system. A few iterations of chip-in-loop learning modified the weights to adjust for approximations in the simulation. We describe a preprocessing scheme, network architecture, and training algorithm. From a set of 5000 images, we selected 4000 at random for training, leaving the remaining 1000 images for test. The best performance we achieved was a test set error rate of 0.6% in software and 0.9% in hardware at rejection rates of 3.6 and 3.7%, respectively.
2 Preprocessing
The primary task of the preprocessing stage was to reduce the number of inputs from 150 to a maximum of 128. Since the digits were small and their centers moved around in the 15 x 10 image field, the recognition system needed some sort of built-in translation independence. We tested three preprocessing schemes: blocking, balancing, and linear compression. Blocking and balancing attempt to effect translation invariance in the preprocessing itself. With linear compression, the translation invariance must be trained into or built into the network. These are summarized in Figure 2. The blocking algorithm computes the number of ”on” pixels in each of the character-sized (10 x 8) windows in the larger (15x 10)image field. The contents of the window containing the most “on” pixels were applied as input to the network. The “empty” rows and columns were ignored. However, noise confounds the algorithm and some of the character itself was stripped away occasionally. Expanded blocking takes the subwindow from the blocking algorithm and includes the adjoining pixels on its periphery. In another approach, called balancing, we find the centroid of the image field and use it as the center of a new window. By itself balancing does not reduce the total number of inputs to the network. We used two different methods to accomplish this. In the first we stripped the peripheral rows and columns from the balanced character, producing a 13 x 8 character. In the second we applied a linear compression. To visualize a linear compression, superimpose an rn x n grid on the original 15 x 10 character. Then, for each cell in the new grid assign the percentage of “black space” filling it.
888
Jeffrey N. Kidder and Daniel Seligson
Figure 2 Preprocessing of a 3. 3 Neural Network Architecture and Training
Using the SN2TMconnectionist simulator (Bottou and Le Cun 19881, we wrote routines to model the ETA" in the desired mode. ETANN deviates from the standard neural network model in two ways. First, its neuron output sigmoid is not the usual hyperbolic tangent (Holler et al. 1989). Second, the synaptic multiply (UiWij) is only a true multiply for small values of ui and wi,, saturating outside this domain. Our model includes a cubic spline interpolation of the chip's sigmoid, but does not address the saturating multiply. The network is designed as a "1 of 10" classifier, meaning that it has 10 output units and that it is intended that only one should fire.During training and testing, the outputs are real-valued on the interval -1 to 1. In deployment, the outputs would be constrained to the binary limits, -1 and 1. In training and testing, the network is said to have classified
Fast Recognition of Noisy Digits
889
the input if the maximum output unit is within some tolerance, E, of +I .O and if all other output units are within E of -1.0. For all other cases (i.e., more than one unit not within E of -1.0, or no unit within E of +1.0), the network is said to have rejected the input. When deployed with binary outputs, the tolerance is effectively E = 1.0. The training procedure was a modification (Seligson et al. 1992) of the backprop algorithm employing both pattern and epoch learning. In one epoch, each training pattern is fed forward through the net and the output is compared with the desired output. If the character is incorrectly classified or the output is rejected, then the error vector is propagated back through the net and the weights are updated; momentum was used and set equal to 0.5. If the character is correctly classified, no change is made. This is different from “plain vanilla” backprop in which an error vector is backpropagated for every pattern. At the end of the epoch, if the percentage of correctly classified patterns is sufficiently high (98%in most of our work) and the error rate sufficiently low (0.1-1.0%) then the network is said to have converged at the current tolerance. If it has not converged, then another epoch is initiated. If it has converged, we lower the tolerance and the learning rate, and then begin another epoch. This procedure is iterated until some other convergence criterion is met. These modifications have two principal advantages. First, by ignoring training vectors which pass the tolerance criterion, the network is encouraged to find weights such that all training vectors are equally bad (or good). Second, by linking the learning rate and the tolerance criterion, we can initialize the training procedure with large values of each (0.1 and 1.0 for this problem) and still achieve good convergence. In practice we have seen that this procedure results in faster and tighter convergence than would be achieved by vanilla backprop using any one fixed value of the learning rate. Gaussian input noise was added to each pixel of the input pattern before a forward pass. We varied the standard deviation of the noise between 0.01 and 0.2. We found the final error rate to be more sensitive to the choice of preprocessing scheme than to small variations in initialization parameters, momentum, or the number of hidden nodes. 4 Results
The best results were obtained by blocking to 10 x 8. We achieved a 0.6% error rate on the 1000 digit test set with a rejection rate of 3.6%. Figure 1 illustrates that this performance is close to the limit of human perception. Table 1 summarizes the performance of six other networks, including a perceptron (i.e., a template matcher trained with backprop). Having selected the 10 x 8 input format, we were ready to map the 80 input, 54 hidden node, 10 output unit network onto the chip. Figure 3 shows, schematically, the 128 x 64 array of weights in the ETA” chip.
Jeffrey N. Kidder and Daniel Seligson
890
Figure 3 Mapping the feedforward network into ETANN. Table 1: Recognition Results.
Input 10 x 8 blocked 10 x 8 blocked 10 x 8 blocked 10 x 8 blocked 12 x 10 blocked with periphery 13 x 8 balanced and trimmed 9 x 7 balanced and compressed
Hidden
Emr %
Reject %
Comments
54
0.9 0.6 3.6 4.0 1.1 1.9 2.7
3.7 3.6 7.6 12.4 2.6 8.9 9.9
On ETA"
54 10 0 54
54 32
Perceptmn
Fast Recognition of Noisy Digits
891
The first layer of weights are loaded as an 80 x 54 block. The outputs (summing down the columns) of this layer are fed into a 54 x 10 block of weights? The output of the second layer is the output of the chip. After computing the weights with SN2TM,we transferred them to iBrainMakerTM2.0 and Intel's Neural Network Training Systemm for implementation and testing in hardware. Downloading weights to E T A " is a process of applying high voltage pulses to program analog floating gate memory devices. We then trained the chip itself for a few epochs, to account for chipto-chip variations, imperfections in the downloading process, and limitations in our software model. The three most significant limitations were (1) that the device synapses have a dynamic range and precision of about 1 in 100, whereas the simulator used floating point arithmetic with effectively unlimited dynamic range and precision, (2) that we used a perfect multiply rather than the chip's saturating multiply, and (3) that we did not account for fabrication related synapse-tosynapse or neuron-to-neuron variations present in E T A " (Holler et al. 1989). On E T A " itself, we achieved an error rate of 0.9% on the test set with a tolerance of 1.0 and a 3.7% rejection rate, substantially the same as in software. Thus we can conclude that ETANN's finite dynamic range and precision did not restrict its ability to perform this classificationtask, and that the learning algorithm as applied to the chip is able to overcome the simplifications of the model.
5 Discussion
We mention briefly some obstacles which hindered us from achieving lower error rates, and some options which were not explored. Firstly, the character font was ill-suited to the task at hand, primarily because the character set was not sufficiently orthogonalized, especially among the 0, 2, 3, 5, 6, 8, and 9. Noise produces ambiguities and misclassifications, both for our network and for human subjects viewing the same data (see Fig. 1). Second, there was not enough training data available. A model of the image generation and acquisition system (in particular the noise) would have allowed us to synthesize a large data set for training, testing, and evaluation. Finally, there were altogether different architectures that we considered but did not investigate. In our chosen method, we preprocessed the 15 x 10 data by scanning a smaller window over the input pattern and then we presented the best window as input to the net. Since we had 80 psec for the task, and the net executes in 6 psec, another alternative would have been to scan the net over the input pattern, postprocessing the outputs to determine the most probable input digit. 3Thresholdsare &plied, not shown.
892
Jeffrey N. Kidder and Daniel Seligson
Acknowledgments We would like to thank Steven Anderson, Finn Martin, and Simon Tam for assistance with ETA", and Maria Douglas for assistance with the manuscript. SN2 is a trademark of Neurestique, iBrainMaker is a trademark of California Scientific Software, and i " T S is a trademark of Intel Corporation.
References Bottou, L.-Y., and Le Cun, Y. 1988. SN: A simulator for connectionist models. In NeuroNirnes 88, Nimes, France. Holler, M., Tam, S., Castro, H., and Benson, R. 1989. An electrically trainable artificial neural network (ETA") with 10240 floating gate synapses. Proc. IJCNN-89-WASH-DC., pp. 191-196. Summer 1989. Rumelhart, D. E., McClelland,J. L., (eds.). 1986. In Parallel DisfributedProcessing: Explorations in theMicrostructure of Cognition. Vol. 1: Foundations. MIT Press, Cambridge, MA. Seligson, D., Hansel, D., Griniasty, M., and Shoresh, N. 1992. Computing with a difference neuron. Network 3,187-204. Received 18 February 1992; accepted 18 March 1993.
This article has been cited by:
Communicated by Eric Baum
Local Algorithms for Pattern Recognition and Dependencies Estimation V. Vapnik L. Bottou AT&T Bell Laboratories, Holmdel, NJ07733 USA
In previous publications (Bottou and Vapnik 1992; Vapnik 1992) we described local learning algorithms, which result in performance improvements for real problems. We present here the theoretical framework on which these algorithms are based. First, we present a new statement of certain learning problems, namely the local risk minimization. We review the basic results of the uniform convergence theory of learning, and extend these results to local risk minimization. We also extend the structural risk minimization principle for both pattern recognition problems and regression problems. This extended induction principle is the basis for a new class of algorithms. 1 Introduction The concept of learning is wide enough to encompass several mathematical statements. The notions of risk minimization and of loss function (cf. Vapnik 19821, for instance, have unified several problems, such as pattern recognition, regression, and density estimation. The classical analysis of learning deals with the modelization of a hypothetical truth, given a set of independent examples. In the classical statement of the pattern recognition problem, for instance, we select a classification function given a training set. This selection aims at keeping small the number of misclassifications observed when the selected function is applied to new patterns extracted from the same distribution as the training patterns. In this statement, the underlying distribution of examples plays the role of a hypothetical truth, and the selected function models a large part of this truth, i.e., the dependence of the class on the input data. We introduce in this paper a different statement of the learning problem. In this local statement, the selection of the classification function aims at reducing the probability of misclassification for a given test pattern. This process, of course, can be repeated for any particular test pattern. This is quite a different task instead of estimating a function, we estimate the value of a function on a given point (or in the vicinity of a given point). Neural Computation 5,893-909 (1993) @ 1993 Massachusetts Institute of Technology
V. Vapnik and L. Bottou
894
The difference between these two statements can be illustrated by a practical example. A multilayer network illustrates the classical approach The training procedure builds a model using a training set. This model then is used for all testing patterns. On the other hand, the nearestneighbor method is the simplest local algorithm: Given a testing point, we estimate its class by searching the closest pattern in the training set. This process must be repeated for each particular test pattern. The statement of the problem defines the goal of the learning procedure. This goal is evaluated a posteriori by the performance of the system on some test data. The training data, however, do not even provide enough information to define this goal unambiguously. We must then rely on an induction principle, i.e., a heuristic method for "guessing" a general truth on the basis of a limited number of examples. Any learning algorithm assumes explicitly or implicitly some induction principle, which determines the elementary properties of the algorithm. The simplest induction principle, namely the principle of empirical risk minimization (ERM), is also the most commonly used. According to this principle, we should choose the function that minimizes the number of errors on the training set. The theory of empirical risk minimization was developed (Vapnik 1982) in the 1970s. In the case of pattern recognition, this theory provides a bound on the probability of errors, p, when the classification function is chosen among a set of functions of finite VC dimension. In the simplest case, with probability 1- q the following inequality is true (Vapnik 1982, p. 156, Theorem 6.7).
plv+D
(1.1)
where v is the frequency of error on the training set, and
is a confidence interval, which depends on the number of training examples I, on the VC-dimension h of the set of functions, and on the confidence q. When the VC-dimension of the set of functions increases, the frequency of error on the training set decreases, but the width D of the confidence interval increases. This behavior leads to a new induction principle, namely the principle of structural risk minimiza tion (SRM) (Vapnik 1982). Consider a collection of subsets imbedded in the set of functions,
s, c s2 c ... c S" where s k is a subset of functions with VC-dimension hk, and hk < h k + l . For each subset sk, a function fk minimizes the frequency of error on the training set, and thus fulfills the inequality 1.1. Successive functions
Local Algorithms for Pattern Recognition
895
yield a decreasing number of errors v on the training set, but have increasingly wider confidence interval D. The principle of structural risk minimization consists in selecting the subset Sk' and the function fk', which minimizes the right-hand side of inequality 1.1, named the guaranteed risk. The SRM principle requires the choice of a nested structure on the set of functions. An adequate structure can significantly improve the generalization performance; a poor structure may have a limited negative impact. In the local statement of the learning problem, we aim at selecting a valid function in the vicinity of a given test point xo. On the basis of the training set, we will select a "width" for this vicinity, as well as a function for classification vectors in this vicinity. To solve this problem, an extended SRM principle will be considered. We will minimize the guaranteed risk not only by selecting a subset Sk' and a function fk* E Sk', but also by selecting the width /3 of the vicinity of the point XO. Using /3 as an additional parameter allows us to find a deeper minimum of the guaranteed risk as demonstrated on a practical application in Bottou and Vapnik (1992). The paper is organized as follows. First, we state and discuss the problem of risk minimization and the problem of local risk minimization. In Section 3, we derive a few useful bounds for uniform convergence of averages to their expectations. In Sections 4 and 5 we derive bounds of the "local risk for the problem for pattern recognition and the problem of regression. In Section 6, we extend the structural risk minimization principle to local algorithms.
fk
2 Global
and Local Risk Minimization
Like many illustrous scientists, we will assume, in this paper, that a metaphysical truth rules both OUT training examples and OUT testing cases. Like many illustrous statisticians, we will also assume that this truth can be represented, for our purposes, by an unknown probability distribution F ( x , y ) , defined on a space of input-output pairs ( x , y ) E R" x R'. In the classical statement of global risk minimization, a parameter a E A defines a model x f ( x , a) of the output y . A loss function, Q[y,f(x, a)], measurable with respect to F ( x , y ) , quantifies the quality of the estimate f ( x , a) for the outputs y . We wish then to minimize the global risk functional
-
over all functions c f ( x , a),a E A}, when the distribution F(x,y ) is un-
V. Vapnik and L. Bottou
896
known, but when a random independent sample of size l XI1 y1;.
. . ; X I , yl
(2.2)
is given. Let us introduce the statement of local risk minimization, in the vicinity of a given point XO. In this statement, we aim at modeling the truth in a small neighborhood around X O . A nonnegative function K ( x , X O , P) embodies the notion of vicinity. This function depends on the point XO, on a "locality" parameter P E (0,oo),and satisfies: i. 0 5 K(xlXO, P) I 1,
ii. K(xo,xolP) = 1. For example, both the "hard threshold locality function (2.3) and the "normal" locality function (2.4) meet these conditions (2.3) (2.4)
Let us define the norm of locality function as IIK(xolP)II = S K ( x > x 0, P ) W l Y )
Let us consider again a parametric functionf ( x , a),and a measurable loss function, Qh,f(x1a)].We want to minimize the local risk functional (2.5)
over the parameters a and P, when the distribution function F(xly) is unknown, but when a random independent sample XI, y~, . .. ,XI, y~is given. In most cases, the knowledge of the distribution F(x,y) would make this problem trivial. For example, if the locality function is either the hard threshold locality function (2.31, or the normal locality function (2.41, we would select /3 = 0, and adjust a to get the "right" value for f ( x 0 , a). The true distribution, F(x,y), however, is unknown. Selecting a nontrivial value for the locality parameter might reduce the generalization error induced by the unavoidable inaccuracy of parameter a. A new induction principle has been developed to take advantage of this fact. It is described in Section 6. Let us apply the statement of local risk minimization to the problems of pattern recognition and regression.
Local Algorithms for Pattern Recognition
897
In the case of pattern recognition, the outputs y take only two values 0 or 1 and c f ( x , a ) ,cr E A} is a set of indicator functions. The simplest loss function
merely indicates the presence or absence of classification error. The risk functional (2.1) then measures the probability of classification error for the function f ( x , a). The global pattern recognition problem consists in selecting, on the basis of the training set, a function f(x,a*)that guarantees a small probability of classification error. Now, let us consider the local risk functional (2.51,using the hard threshold locality function (2.3). This functional measures the conditional probability of classification error knowing IIx - XOI( 5 ,@/2. The local pattern recognition problem consists in selecting, on the basis of the training set, a value for the locality parameter P' and a function f ( x , a*)which guarantee a small probability of classification error in Ilx -xoll IP/2. In the case of the regression problem, the outputs y are real values, and c f ( x , a),a E A} is a set of real functions. We will consider a quadratic loss function,
Qb,f(x,a11 = b - f(x, a)]'
(2.6)
The minimum of the global risk functional (2.1) is achieved by the closest function of the class c f ( x , a),a E A} to the regression function Y(X)
= EtY I x ) = J
Y W Y I x)
(2.7)
using a quadratic metric
The minimum of the local risk functional (2.5), using the locality function (2.3),is achieved by the closest function of the class c f ( x , a),a E A} to the regression function, ~ ( x using ) a metric
(2.8) where
The local regression problem consists, on the basis of the training set, in selecting a value for the locality parameter p' and a function f ( x , a*), which guarantees a small conditional quadratic error.
V. Vapnik and L. Bottou
898 3 Theory of Uniform Convergence
For simplicity, we will refer to the pairs (x,y) as a vector z, and we will denote the loss function Q[y,f(x, a)]as Q(z,a). The notation F(z)denotes a probability distribution on the pairs ( x , y). First, we will review uniform convergence results for the global risk functional
These results are then extended to the local risk functional, using a transformation of the probability distribution F(z). The global risk can be made local by "oversampling" the probability distribution around the point xo. We already have stressed the fact that optimizing (3.1) is generally impossible, unless we know F(z) exactly. If our knowledge of F(z) is limited to a random independent sample (3.2)
Zl,*'*,ZI
we must rely on an induction principle, like the empirical risk minimization (ERM) or the structural risk minimization (SRM). A good induction principle should provide a way to select a value that guarantees a small value for the risk R(a;). More precisely, two questions should be answered: 1. When is the method of empirical risk minimization consistent? In other words, does the generalization risk R(af) converge to the minimum of the risk functional R(a)when the size l of the training set increases?
2. How fast is this convergence? In general, the number of training examples is limited, and the answer to this question is of crucial practical importance.
Many induction principles, including SRM and ERM, rely on the empirical risk functional 1
1
(3.3)
which estimates the risk R(a) using the training set (3.2). In these cases, the answers to the two questions stated above depend on the quality of the estimation (3.3). More precisely, 1. Does the empirical risk functional El(a) converge to the risk functional R ( a ) when the size of the training set increases, uniformly
Local Algorithms for Pattern Recognition
899
over the set of functions {Q(z,a),a E A}? The uniform convergence takes place if sup IR(a)- El(a)l > 6 UEA
2. What is the rate of this convergence? The theory of uniform convergence of empirical risk to actual risk developed in the 1970s and 1980s (cf. Vapnik 19821, contains a necessary and sufficient condition for uniform convergence, and provides bounds on the rate of uniform convergence. These bounds do not depend on the distribution function F(z); they are based on a ,measure of the capacity (VC-dimension) of the set of functions {Q(z,a), a E A}.
Definition 1. The VC-dimension of the set of indicator functions {Q(z,a),a E A} is the maximum number h of vectors 21, . . . ,Zh that functions of the set {Q(z,a), a E A} can separate into two classes in all 2h possible ways. Definition 2. The VC-dimension of the set of real functions {Q(z,a), a E A} is defined as the VC-dimension of the following set of indicator functions Q&a) = e [ Q ( z , ~-) c],
0
E A, c E i$Q(z,a),supQ(z,a)
w
1
where c is fixed,
e(u) =
1 ifu>O 0 otherwise.
Two theorems are valid for a set of indicator loss functions. We assume here that the loss functions Q(z, a),a E A are set indicator functions
defined in z-space. Theorem 1. Let the set of indicator functions {Q(z,a),a E A} have VCdimension h. Then the following inequality is true: (3.4)
This theorem is proven in Vapnik (1982, p. 170, Theorem A.2). The quantity (21)h/h! has been bounded by the more convenient quantity (2l~?/h)~, using Stirling's formula. Bound (3.4),however, is limited by the behavior of the absolute difference between the risk and the empirical risk when the risk R(a) is close to 1/2. Theorem 2 provides a bound on the "relative" difference between the risk and the empirical risk.
V. Vapnik and L. Bottou
900
Theorem 2. Let the set of indicator functions {Q(z, a ) , a E A} have VCdimension h. Then the following inequality is true:
This theorem is proven in Vapnik (1982, p. 176, Theorem A.3). Again, the quantity (21)h/h!has been bounded by the more convenient quantity (21e/h)h,using Stirling's formula. Both Theorem 1 and 2 can be generalized to the case of uniformly bounded loss functions. We assume now that the loss functions Q(z, a) are nonnegative, and satisfy the condition 0 5 Q ( z , a )IB ,
aEA
(3.6)
Theorem 3. Let the uniformly bounded set of real functions (3.6) have VCdimension h. Then the following bound is true: (3.7) This theorem is proved in Appendix 1.
Theorem 4. Let a uniformly bounded set offunctions (3.6) have VC-dimension h. Then thefollowing bound is valid:
This theorem is proved in Appendix 2. Finally, we need a bound on the rate of uniform convergence for a set of unbounded real functions {Q(z,a),a E A}. Such a bound requires some restriction on the large deviations of the set of loss functions. This is also true for the classical bounds. Although the law of large numbers says that the average of random values converges to their mathematical expectation, the rate of convergence could be slow. The next example shows that even in the case when the set of functions contains only one function it is impossible to bound the rate of convergence without additional information. Consider a random variable 6 that takes two values: 0 with probability 1 - E and l/$ with probability E . The expectation of this random variable is €
1
E ( C ) = (1- 6)O + - = ; €2
Local Algorithms for Pattern Recognition
901
The empirical average is null if all 1 observations are 0. The probability of this event is
P(0) = (1 - E ) I For a small E, the expectation E ( { ) is large, but the empirical average is null with a high probability. In Theorems 1 to 4, we have assumed a uniform bound (1 or B) on the losses Q(z,a). This bound forbids large deviations. We consider now the case of nonnegative, unbounded losses, which satisfy the following mild restriction: (3.9)
This condition reflects a restriction on the “tails” of the distribution of the losses Q(z, a). Generally, it means that the probability that random value supaE,, Q(z,a) exceeds some value A “decreases fast” when A increases. Value T determine how fast it decreases. For instance, let Q ( z , Q )be a quadratic loss - f ( x , If the random variable
.2I).
ca
=y-f(x,a)
is distributed accordin to normal law, the ratio of moments in condition (3.9) is equal to 3 (independent of the values of parameters). If the random variable is distributed according to Laplace law, this ratio is (also independent of the value of the parameters). equal to The following result has been proved in Vapnik (1982, p. 202).
/
Theorem 5. Let {Q(z,a ) ,a E A} be a set of nonnegative real functions with VC-dimensionh. Then the following bound is true:
where a ( € )=
J 1 - -1nc
In this formulation again, (21)h/h!has been bounded by (21e/h)h.We obtain a uniform bound for the relative difference between the risk and the empirical risk by applying condition (3.9) to this result.
V. Vapnik and L. Bottou
902
Let us extend this inequality to the case of local algorithms. First, for any fixed value of a and P, note that the local risk functional
is equal to the expectation of the loss function Q(z,a) with respect to a new distribution function F(z,P) defined by
We will consider the set of functions Q(z,a),a E A and the set of probability distribution functions F(z,P), P E ( 0 , ~ that ) satisfies the following inequality:
(3.12) Let us define the unnormalized local risk, R ( a ,P, X O ) , and the unnormalized empirical local risk, &l(a,P, X O ) , as follows:
R(w P, X O ) =
J Q ( z ,a ) K ( x ,
X O , P)dF(z)
1
1
We will show that under condition (3.12)the following inequality is true
w a ,P, xo) - Eda,P, xo) > CXEA
Tea(€)
\/m (3.13)
where h' is VC-dimension of the set of functions
{Qk, a ) W ,xo, PI, a E 4 P E [O,4) To prove this inequality, we note that Theorem 5 implies the following inequality:
(3.14)
Local Algorithms for Pattern Recognition
903
Moreover, since 0 5 K(x,xo,P) 5 1, we have
-/,
\IJa2(1,u)K2odF(T)5
,/J Q Z ( ~ , a )I I K ( ~P)IIW~, ~, P)
5
and according to (3.12), the following inequality is true for any P
E
(3.15)
LO,4. (3.16)
Inequality (3.13) is derived from inequalities (3.141, (3.15), and (3.16). 4 Bounds for the Local Risk in Pattern Recognition
In this section, we apply the previous results to the problem of pattern recognition. Consider the set of integrands of the unnormalized local risk functional, R(a,P, XO):
{ Q(z,a)K(x,xo,P), a
E A,
P E (0,m)
1
(4.1)
where Q(z,a) is an indicator function and K(x, xo, P ) is a nonnegative real function. Let hl be the VC-dimension of the set of indicator loss functions { Q ( z , a ) ,a E A}. Let hZ be the VC-dimension of the set of nonnegative real functions {K(x, XO, p),P E ( 0 , ~ ) ) . Since Q(z, a) takes only the values 0 or 1, the following equality is true for any nonnegative real function T ( Z , P). e{Q(z, ~ ) T ( zP), - C }
=
Q(z, a)d{r(z,P ) - c}, a
E
A, P
E (O,m),c E (0,W)
Moreover, it is known that the VC-dimension of the product of two sets of indicator functions does not exceed the sum of the VC-dimension of each set of indicator functions. Therefore, the definition of the VCdimension of a set of real function implies that the VC-dimension of the set of functions
{ Q(z,a)K(x,xo,P), a E A, P does not exceed hl
E (0,m)
1
+ hz. Let us apply Theorem 4 to this set of functions.
V. Vapnik and L. Bottou
904
Let 7l/2 denote the right-hand side of this inequality. By solving the equation
and replacing the result into our inequality, we obtain an equivalent formulation: With probability 1- 7/2, the following inequality is true for all functions in { Q ( z ,a), (Y E A, p E (0,~)).
W a ,P, 2 0 ) I El(@,P, 2 0 ) + .9
(4.2)
where
By dividing both sides of inequality (4.2) by
IlK(x0,
P)II, we obtain
The value of I IK(x0, p)I I in the right-hand side of inequality (4.4) depends on the distribution function F(z). A lower bound for the value IlK(x0, p)II is obtained by using the empirical functional:
where zi = (xi,yi) are the elements of the training set (3.2). Applying Theorem 3 to the set of uniformly bounded functions { K ( x , x ~ , p ) , PE (0,oo)) results in
In other words, the following inequality is simultaneously true for all /3 E [0,w[, with probability 1 - 7,1/2:
Local Algorithms for Pattern Recognition
905
where (u)+ = Max{O, u } . Let us define K ( x 0 , P ) as the right-hand side of inequality (4.5).
By combining inequalities (4.4)and (4.51,we obtain the following theorem, which provides a bound for the local risk functional in the case of pattern recognition.
Theorem 6. Let the VC-dimension of the set of indicator functions { Q ( z ,a ) , a E A} behl. Let the VC-dimensionof theset of realfunctions { K ( x , x o , P ) , P E (0, m)} be h2. The following equality is simultaneousely fulfilled for all a E A and /3 E (0,oo),with probability 1 - 7:
where
As expected, the VC-dimension hl and h2 affect the quantity E, which controls the second term of the sum. The VC-dimension h2 of the set of locality functions {K(x,xo,P),P E (O,m)}, however, also affects the first term of the sum, which is the empirical estimate of the local risk functional. Therefore, it seems extremely advisable to use monotonic radial basis functions for defining the vicinity of a point xo. In fact, the VCdimension of the set of radial basis functions
where the Kp(r) are the monotonically decreasing functions of r, is equal to 1. 5 Bounds of the Local Risk in Regression Estimation
In this section we apply the results presented in Section 3 to the problem of local regression estimation. The loss functions Q(z,a) are now real functions. In the case of pattern recognition, the loss functions were indicator functions. In this case, we have proved that the VC-dimension of the set { Q ( z , a ) K ( x , x ~ , Pa) ,E A, /3 E ( 0 , ~ ) )does not exceed the sum of the VC-dimensions of the sets of functions { Q ( z , a ) , a E A} and { K ( x ,xo, PI, P E (0,..)I.
V. Vapnik and L. Bottou
906
This is no longer true in the case of real loss functions. For example, let { Q ( z ,a),a E A} be the set of monotonically increasing functions, and { K ( x , XO, P), P E (0,oo)) be the set of monotonically decreasing functions. Although the VC-dimension of both sets is 1, the VC-dimension of the product of these sets is infinite. To apply the uniform convergence results, we will assume that the VCdimension h' of the set of function {Q(z,a ) K ( x ,X O , P), a E A, 0 E (0,~)) is finite. We also assume that the functions Q(z,a) are nonnegative, and satisfy condition (3.12). From inequality (3.13) we derive the following inequality, which is simultaneously valid for all a E A, P E (0,oo),with probability 1 - 77/2.
where
+
h[ln(2l/h) 1) - ln(p/24)
(5.2)
W ( x o , P)II
In section 4, we proved that inequality (4.6) is true. Using (4.6) and (5.1), we obtain the following result: Theorem 7. Let the VC-dimension of the set of nonnegative real functions
{Qk, a ) K ( x ,X O , PI, E A, P E ( O , C Q ) ) Q
be h. Let the VC-dimension of the set of locality functions
be hp. The following inequality is simultaneously valid for all a with probability 1- 77,
E
A, P E (0,a),
(5.3)
where
+
h[ln(2l/h) 11- 1n(7/24) UXO,
P)
and IC(x0, P) is defined in (4.6). This result provides a bound on the local risk functional for the case of regression estimation.
Local Algorithms for Pattern Recognition
907
6 Local Structural Risk Minimization
We can now formulate the principle of local structural risk minimization, using the bounds provided by Theorems 6 and 7. In this section, the local structural risk minimization (LSRM) principle is formulated for pattern recognition. The regression case is essentially similar. Let us consider a nested structure on the set of indicator functions { Q ( z , a ) , cl E 1111
S1 C Sz C,. . . , C S, = {Q(z, C Y ) ,
CY
E A}
(6.1)
Let the VCdimension of each subset S, be hl(p), with
hl(1) < hI(2) < ... < h l ( # ) We have proved, in Section 4, that the VC-dimension of the set of functions
{Q(z,a)K(x,xo,PI,
QI
E
P E (0,
is smaller than hl(p) + hZ, where h2 denotes the VC-dimension of the set of real functions { K ( x ,XO, P), P E (0,~)). According to Theorem 6, the following inequality is simultaneously valid for all element S, of the structure, with probability 1 - 1117:
Principle. The local structural risk minimization principle consists in choosing the element of structure S, and the parameters a E Ap and P E (0,m)that minimize the guaranteed risk as defined by the right-hand side of inequality (6.2). The various constants in bound (6.2)are the result of technical properties of the bounding derivations. The “proven” value is irrelevant to practical problems. Therefore, it is advisable to design experiments to measure these constants, and to use these measured values instead of using the “proven” values.
908
V. Vapnik and L. Bottou
Appendix 1: Proof of Theorem 3 Using Lebesgue's sums, we can write:
where v{Q(z,a) > Bn/N} denotes the frequency of the event
obtained on the basis of the sample 2 1 , .. . ,ZI.Then
Using Theorem 1 and this inequality, we obtain
where h is the VC-dimension of the set of indicator functions
According to Definition 2, this quantity is the VC-dimension of the set of real loss functions {Q(z,a), a E A}. Theorem 3 is thus proven.
Local Algorithms for Pattern Recognition
909
Appendix 2: Proof of Theorem 4 Again, consider a set real functions { Q ( z ,a),a E A} of VC-dimension h, and assume 0 < Q ( z , a ) < B. The following result is proven in Vapnik (1982, p. 197, Lemma).
Using the Cauchy inequality, we can write
=
d ~ ~ ( a )
We replace this result in inequality (6.1); we bound (21)h/h!by the more convenient expression (2lell1)~; and obtain
Theorem 4 is thus proven. Acknowedgments We thank the members of the Neural Network research group at Bell Labs, Holmdel, for useful discussions. S. Solla and C. Cortes provided help to render this article more clear. References Bottou, L., and Vapnik, V. 1992. Local learning algorithm. Neural Comp. 4(6), 888-901. Vapnik, V. 1982. Estimation of Dependencies Based on Empirical Data. SpringerVerlag, New York. Vapnik, V. 1992. Principles of risk minimization for learning theory. In Neural Information Proceeding System, David S. Touretszky, ed., Vol. 4, pp. 831-839. Morgan Kaufmann, San Mateo, CA. Received 14 July 1992; accepted 4 March 1993.
This article has been cited by: 2. Nicola Segata, Enrico Blanzieri. 2010. Operators for transforming kernels into quasi-local kernels that improve SVM accuracy. Journal of Intelligent Information Systems . [CrossRef] 3. Lee K. Jones. 2009. Local Minimax Learning of Functions With Best Finite Sample Estimation Error Bounds: Applications to Ridge and Lasso Regression, Boosting, Tree Learning, Kernel Machines, and Inverse Problems. IEEE Transactions on Information Theory 55:12, 5700-5727. [CrossRef] 4. Erik McDermott, Shigeru Katagiri. 2006. Discriminative training via minimization of risk estimates based on Parzen smoothing. Applied Intelligence 25:1, 37-57. [CrossRef] 5. M. Muselli. 1997. On convergence properties of pocket algorithm. IEEE Transactions on Neural Networks 8:3, 623-629. [CrossRef] 6. David H. Wolpert. 1996. The Lack of A Priori Distinctions Between Learning AlgorithmsThe Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation 8:7, 1341-1390. [Abstract] [PDF] [PDF Plus]
Communicated by Shun-ichi Amari and Halbert White
On the Geometry of Feedforward Neural Network Error Surfaces An Mei Chen Haw-minn Lu University of California, Sun Diego, C A USA
Robert Hecht-Nielsen HNC, Inc. and University of California, San Diego, C A USA
Many feedforward neural network architectureshave the property that their overall input-output function is unchanged by certain weight permutations and sign flips. In this paper, the geometric structure of these equioutput weight space transformations is explored for the case of multilayer perceptron networks with tunh activation functions (similar results hold for many other types of neural networks). It is shown that these transformations form an algebraic group isomorphic to a direct product of Weyl groups. Results concerning the root spaces of the Lie algebras associated with these Weyl groups are then used to derive sets of simple equations for minimal sufficient search sets in weight space. These sets, which take the geometric forms of a wedge and a cone, occupy only a minute fraction of the volume of weight space. A separate analysis shows that large numbers of copies of a network performance function optimum weight vector are created by the action of the equioutput trandformation group and that these copies all lie on the same sphere. Some implications of these results for learning are discussed. 1 Introduction
For the sake of concreteness, we will concentrate in this paper on the “multilayer perceptron” or ‘%ackpropagation” feedforward neural network architecture (Rumelhart et al. 1986; Hecht-Nielsen 1992). However, many of the results we present can be reformulated to apply to other neural network architectures as well [e.g., the radial basis function networks of Reilly et al. (1982), Broomhead and Lowe (1988), Moody and Darken (1989), and Poggio and Girosi (1990); the ART networks of Carpenter and Grossberg (19911; counterpropagation networks (Hecht-Nielsen Neural Computation 5,910-927 (1993) @ 1993 Massachusetts Institute of Technology
Feedforward Neural Network Error Surfaces
911
1991);and the mutual information preserving networks of Linsker (1988) and Becker and Hinton (1992)J. The layers of the neural networks we consider in this paper are assumed to have units with transfer functions of the form zli = s(4i) MI
Ili
=
C wlijz(l-l)i j=O
for 1 > 1, where tanh(u) for layers 2 through K - 1 u for layer K
number of layers in the network (including input and output layers) layer number (1 through K) number of units on layer 1, assumed to be > 1 xi
= ith component of the
external input vector x
15iSn
yi = jth component of the network output vector y’
15j5rn 1.0 weight of unit i of layer 1 associated with input z(l-l)jfrom layer I - 1 Each layer in this architecture receives inputs from all of the units of the previous layer, but none from any other layer. (Note: all of our results either remain the same or easily generalize when connections skip layers, but the mathematical notation becomes messy. Thus, only the simplest case is presented here.) The network weight vector of a multilayer perceptron neural network is the q-dimensional real Euclidean vector w whose components consist of all of the weights of the network in some fixed order. We shall refer to the space of all such weight vectors (namely, R4) as weight space and denote it by W. Clearly, the network weight vector determines the input-output transfer function of the network.
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
912
For the purposes of this paper, we shall assume that each multilayer perceptron neural network under consideration is being used to approximate a fixed, square integrable (i.e., Lz)function
f : A c R "- R m where A is a compact subset of R". Further, we shall assume that the performance of the network is being measured by some performancefunction F(w) that depends only on the network's input-output transfer function (which depends only on the selection of the network weight vector w) and on the manner in which the x vectors in A are chosen (which is assumed to be in accordance with a fixed scheme: namely, random selection with respect to a fixed smooth [C" probability density function p(x) such that the elements of a random vector with this density are linearly independent with probability one]. It suffices (but is not necessary) that the covariance matrix associated with p exist and be finite and nonsingular. Note that this method of choosing the input vectors ensures that they will not have a fixed linear relationship with one anotherwhich could introduce undesired symmetries into the weights of the first hidden layer. Given network performance function F(w), we can view F as a surface hovering over the weight space W with its altitude at each point w determined by the performance of the network with that selection of weight values. Given a multilayer perceptron network, a function f : A c R" -, Rm to approximate, a probability density function p(x), and a particular performance function F(w), we shall refer to such a surface as the performance surface of the network. These assumptions about the network performance function are very mild. For example, functions as diverse as mean squared error, median of squared errors, and the supremum of errors, namely
Fm(w) = N+" lim median[If(xl)- y'(xI,w)(', . . . , If(x~)- y ' ( x ~,w)I2] FdW) = XEA
SUP and p(xp.0
!m-
YYX, w)l
are accommodated within the definition, where y'(x,w) is the output of the multilayer perceptron network, which is an approximation of the desired output of the network y = f(x). The key element of the performance function definition is its dependence only on the input-output transfer function of the network. This allows the network performance to be evaluated not only in terms of
Feedforward Neural Network Error Surfaces
913
just the errors it makes, but also, if desired, in terms of other factorssuch as the curvature of its “approximating surface” (as determined by functions of derivatives of y’(x, w)with respect to the components of the input vector x), as in the networks of Bishop (1991,1990)and Poggio and Girosi (1990). However, explicit dependence of the performance function on factors that are not determined by the input-output behavior of the network-such as direct dependence on the number of hidden units-is not allowed by this definition. The main focus of this paper is the study of geometric transformations of weight space that have the property that they leave the input-output transformation of the neural network unchanged. Obviously, such transformations will also leave all network performance functions unchanged. We begin in Section 2 by showing that all such equioutput transformations are compositions of two simple classes of isometries. Following this, we show that the set of all equioutput transformations forms an algebraic group of surprisingly large order. The fact that there exists a large group of equioutput transformations in weight space implies that performance surfaces are highly redundant, since each network weight vector is equivalent (in terms of determining the same network input-output transfer function and performance) to a multitude of other weight vectors. Thus, if we are searching for an optimum of a performance function, it would seem to be possible, at least in principle, to make our search more efficient by confining it to a small subset of weight space in which all redundancy has been eliminated. In Section 3 we proceed to further analyze our equioutput transformation group by showing that it is isomorphic to a direct product of Weyl groups. In Section 4 we then exploit known facts about these Weyl groups and the root spaces of their associated Lie algebras to derive a set of simple inequalities that defines nonredundant search sets having the geometric forms of a wedge and a cone. These minimal suficient search sets occupy only a minute fraction of the volume of weight space and contain no two equivalent weight vectors, while containing weight vectors equivalent to every weight vector in the space. In Section 5 we consider yet another implication of our transformation group: that each weight vector that optimizes the network performance function is equivalent to many other such optima, and that all of these lie on the same sphere. Finally, in Section 6 we consider the implications of the results of Sections 2,3,4, and 5 for neural network learning. 2 Equioutput Transformations
We begin by studying the properties of weight space transformations that leave the network input-output transfer function unchanged. These transformations are now defined.
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
914
Definition 1. An equioutput transformation is an analytic (i.e., continuous and expandable in a power series around any point) mapping g : W -+ W from weight space to weight space which leaves the output of the neural network unchanged. In other words,
y’(x,g(w))= y’(x,w) for all x E R” and all w E W . First, consider two types of equioutput transformations: hidden unit weight interchanges and hidden unit weight sign flips. For simplicity, we will refer to these transformations as interchanges and sign pips. An interchange consists of a network weight vector component permutation in which the weight vectors of two hidden units on the same hidden layer are simply interchanged without changing the orderings of the weights within the units. (Note: the term unit weight vector refers to the vector with components equal to the weights within a single unit.) A compensatory interchange of the weights of the next layer units that receive the inputs from the two interchanged units then removes the effect on the network output of the exchange of weights in the previous layer (see Fig. 1). The other type of equioutput transformation is where the weight vector of a hidden layer unit is multiplied by -1 (resulting in a sign flip of the output of the unit-since tanh is an odd function). A compensatory sign flip is then carried out on all of the weights of units of the next layer associated with the input from the sign flipped unit output (see Fig. 2). We now show that: Theorem 1. All equioutput transformations of W to W a r e compositions of interchange and sign pip transformations. Proof. By induction. Let g be any equioutput transformation. We first note that, since y’ is the output of the network for input x, we have MK-I
y:(x,w)
WKij z(K-l)j(x,w)
= j=O
MK-I g(W)Kij z(K-l)j(x,g(w))
= j=O
=
Ykdw))
(2.1)
where y:(x,w) is the output of the original network, y:(x,g(w)) is the output of the network with weight vector g(w) (the transformed network), , and z(~-~)j(x, g(w))are, respectively, the outputs and where q ~ - l ) j ( xW) of the jth units of the last hidden layers of the original and transformed networks. The first step is to take the second partial derivatives of both sums of i W j K , , ~ .Note equations 2.1 with respect to output layer weights ~ ~ and
Feedforward Neural Network Error Surfaces
...
915
...
Figure 1: Interchange transformations involve interchanging the weight vectors of two units on a hidden layer (which has the same effect as interchanging the units themselves). The weights within the units of the next layer that act upon the inputs from the interchanged units are themselves interchanged. The net result is no change in the outputs of these next-layer units. that the first partial of the first sum with respect to WKij is equal to w ) and the second partial of this sum is zero because the output of the last hidden layer of the original network has no dependence on the output layer weights. Thus, we get
Z(K-~)~(X,
(2.2)
for the second partial derivative of the second sum. If we write out the mathematical forms of the four sums of equation 2.2 we see that each nonzero term in each is a transcendental form
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
916
*
*
*
Figure 2: In a signflip transformation, the signs of all of the weights of a single unit on a hidden layer are changed. The signs of the weights acting on this unit’s output in all of the units on the next layer are also flipped. Again, as with interchange transformations, the outputs of the next-layer units are unchanged that will be, by inspection, in general (i.e., for a11 but a set of weights of measure zero), linearly independent with respect to all of the other terms as the network input x is varied (except for pairs of terms in the middle two sums that have matching values of 1). Thus, except for these pairs, each term in these sums must itself be identically zero. The bad sets of weights need not concern us because we can analytically continue into them. Consider the fourth sum first. In this sum the g(w)Kilare not zero in general so we must have ~ z ( K - l ) l ( x i g ( w=) O ) hKuv h K i j
for all u , v,i,j, and 2. Therefore, (2.3) where w is the vector w with all of the output layer components WKjj removed. In the first sum of equation 2.2 the z(K-1)I(x7g(W))sare clearly
Feedforward Neural Network Error Surfaces
917
nonzero in general, so we must have
for all u , ZI, i, j , and I (we will see below why this is true). Finally, we note that not all of the dg(w)Kil/&.uKuvcan be zero, for if they were, then the output weight g(w)Kil would not depend on any of the output weights of the original network. This cannot be, since then there would be no way to set all of the outputs of the transformed network to zero for all x inputs (which we can do by simply setting all of the output weights of the original network to zero). Thus, we conclude that dz(~-~y(x, g(w))/bKij must be zero. Thus, the weights of the hidden layers of the network do not depend upon the output layer weights [i.e., all of the blv(x,w) of equation 2.3 must be zero]. We now explore the relationship between the output layer weights W K ~ , and g(w)Kij of the original and transformed networks. To do this we expand both z ( K - ~ ) ~W) ( x , and z(K-l),(x,g(w))as power series in x and substitute these into equations 2.1. These expansions are given by
z(K-l)j(x,W)
= Uij
+
and
z(K-l)j(x,g(w))= U2j
b1j + -XTCljX I
XT
2
+ ..
I T + XT bzj + ~x Czj + X
* * *
When we substitute these quantities into the sums of equations 2.1 we note that, since these sums are equal for all values of x, that all of the coefficients in these power series expansions must be equal as well. Thus, we get an infinite set of linear equations MK-I
c
j=O
c g(w)Kij
MK-I
WKij alj
=
j=O
and so on. Since the coefficients in the multidimensional Taylor’s series for z(~-l)j(x, W) and z(K-l)j(x,g(w))are set by controlling the nonoutput layer weights of the network, and since the functional forms achievable by setting these nonoutput layer weights are a rich set of functions [see Sussmann (1992) for a discussion of this property], these equations are, in
918
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
general, not linearly dependent [note that as the coefficients are changed the W K and ~ g(W)KijS remain fixed]. Thus, the W K ~and S g(w)fijs must be linearly dependent. So, we can write each g(w)Kijas a linear combination of the WKijS, and vice versa, with coefficients that themselves cannot be functions of the output layer weights. Thus, we can write
Substituting equation 2.4 into equation 2.1 and taking the partial derivative with respect to WKij then gives
MK-I
Note that if we set all of the output weights but W K to ~ zero (without changing any of the other weights in the network), equations 2.4 and 2.5 imply that exactly one of the d(w)iuj must equal either +1 or -1 for each fixed i and j. The others must be zero. This follows because neither the d ( W ) i u j S nor the z(K-l)u[x,g(w)]s can be functions of the output weights. Finally, if we substitute equations 2.4 and 2.5 into equation 2.1 we get
j=O
u=O
From this we see that d(w)iuj = d(w)~-,,jfor all i and k. Thus, the d values for each output unit are the same. Thus, the only possible equioutput transformations are those that have the effect of sign flips and interchanges on the last hidden layer. That this is true for all of the other hidden layers (if any) is clear, since we can simply "clip off the output layer of the network and invert the tanh functions of the last hidden layer
Feedforward Neural Network Error Surfaces
919
outputs to create another multilayer perceptron. Applying the above argument to this network then shows that all equioutput transformations act as compositions of interchanges and sign flips on the second to last hidden layer, and so on. 0 We believe, but cannot prove, that the above theorem would hold even if we only demanded continuity (and not analyticity) of our equioutput transformations. In addition to our analytic equioutput transformations, there exist discontinuous conditional equioutput transformations as well. For example, if all of the weights acting on the output of a hidden unit are zero, then the weights of that unit can take on any values without altering the output of the network. So, this unit’s transformed weights might, for example, be set to constant values under this condition, yielding a discontinuous equioutput transformation. Sussmann (1992) has studied and cataloged some of these situations (he has also established a result closely related to Theorem 1). These discontinuous transformations may be worthy of further investigation,as they establish the existence of affine subspace “generators” at at least some points on performance surfaces. If it could be shown that all performance surfaces are generated (like the cylinder (x2/a2)+(y2/a2) = 1 or the hyperboloid (x2/a2)+(y2/a2)-(z2/b2) = 1 in three-dimensional space can be generated by moving a line along a trajectory and correctly controlling the attitude of the line as it moves), this might provide a new line of attack for understanding the geometric structure of such surfaces. Whether performance surfaces will, in general, turn out to be generated is unclear. 3 The Structure of Group G
In this section we show that the set of all equioutput transformations forms a group. Then we analyze the structure of this group. Theorem 2. The set of all equioutput transformations on W forms a nonAbeliun group G of order #G,with
n (M/!)(2M’)
K-1
#G =
1 s
Proof. We first note that the set of interchange transformations involving the interchange of unit weight vectors on hidden layer 1 is in one-to-one correspondence with the set of all permutations of MI elements. Thus, there are (MI!) different interchange transformations for layer 1. The number of si n flips is the same as the number of binary numbers with MI bits, or 2%’. It is easy to show that the interchange and sign flip transformations of one layer commute with those of any other layer. Thus, they are independent and the numbers of transformations on different layers are multiplied to obtain the order of the group. Finally, the set of all such transformations forms a group because, first, it is
920
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
a subset of the finite symmetry group of a coordinate-axis-aligned cube centered at the origin in weight space, and second, because it is closed under composition (i.e., the product of any two transformations in the 0 set is another transformation in the set). Thus, the set of all weight space transformations that leave the network input-output function unchanged forms an algebraic group G. We now analyze the structure of this group, beginning with a definition. Definition 2. The group Ok is the set of transformations on the vector space Rkgenerated by changing the sign of any one coordinate component and by interchanging any two coordinate components. The 0 notation is used because Ok is isomorphic to the symmetry group for a cube in k-dimensional space. An important fact about Ok is that it is the Weyl group of the n-dimensional classical Lie algebra Bk (Humphreys 1972;Helgason 1962;Weyl 1946). For the results presented here the most important property of the group uk is the fact that it can be represented as the set of reflections generated by the roots of the specific Lie algebra Bk with which it is associated. From Weyl group theory (Humphreys 19721, the order of the group Ok is k! 2k-which is exactly the size of the set of interchangeand sign flip equioutput transformations for a hidden layer with k units. Thus, G might be constructible from Weyl groups. We now show that this suspicion is correct. Theorem 3. The group G is isomorphic to 0~~x OM, x . x OM^-, . Proof. Write the weight space W as B2 x
U2 x B3
x U3 x
x
BK x UK,
where BIis the subspace of bias weights of layer I, and UIis the subspace of nonbias weights of layer 1. The group action can then be expressed for each hidden layer as the direct operation of the cube symmetry group OM, on each subspace BJ and as the indirect, but isomorphic, operation of nonbias weight interchanges and sign flips on the subspaces Ul and U(I+,). Only the hidden layers have symmetry groups associated with them, since the input and output layer units cannot be permuted. Thus, each hidden layer contributes exactly one cube symmetry group to the overall group action. The group is isomorphic to the direct product of these groups because the actions of the individual groups operating on different layers commute. 0 4 Search Sets
In this section we consider minimal sufficient search sets for multilayer perceptron neural networks. First, we provide some definitions that will be needed.
Feedforward Neural Network Error Surfaces
92 1
Definition 3. Two network weight vectors u and v in Ware equivalent iff there exists g E G such that g(u) = v. Definition 4. A minimal suficient search set is a subset S of W such that each w in W is equivalent to exactly one element in S. Definition 5. An open minimal sufficient search set is the interior of a minimal sufficient search set whose closure is equal to the closure of its interior. Previous work by Hecht-Nielsen (1990)and Hecht-Nielsen and Evans (1991)demonstrated that there exist reasonably small, but nonminimal, sufficient search sets in the form of a cone bounded by hyperplanes. We now improve on this earlier work by showing that there exist open minimal sufficient search sets in the geometric forms of a wedge and a cone bounded by hyperplanes. Further, we supply formulas for these sets in terms of simple linear and quadratic inequalities, respectively. Theorem 4. The wedge interior described by the inequalities
Proof. We construct the wedge by piecing together the Weyl chambers in the subspaces BI of W. The cone is then constructed from the wedge. First, to simpllfy the notation, we define U Uzx U3 x . . . x UKso we can rewrite our decomposition of weight space as W = BZ x B3 x . . x BK x U.
=
-
To begin our proof we observe that, since the Weyl group OM, acts directly on BI, BI can be identified with the root space of the classical Lie algebra BM, (Varadarajan 1984). This identification is unique because this particular Weyl group acts directly only on the root space of this one Lie algebra. An open minimal search set for the action of on the root space of B M ~is an open convex subset of the root space known as a Weyl chamber (Varadarajan 1984; Humphreys 1972; Helgason 1962). We will use DI to denote the corresponding subset of BI. To proceed, we shall need the following technical Results concerning compositionsof a group GIdirectly acting on space V1 with open minimal search set SI,and a group GZ directly acting on space VZ with open minimal search set Sz. In particular: 1. Let GIx GZact on VI x Vz coordinatewise. Then S1 x S2 is an open minimal search set of V1 x V z under GIx G1.
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
922
2. If GI = G2 = G and g E G acts on (v1,vz)by g(v1,vZ) = (gvl,gv2), then S1 x V 2is an open minimal search set for V1 x V2 under GI.
The proofs, which are elementary, are omitted. By applying Result 1 successively to B2 x B3, then to ( B 2 x B3) x B l , and so on, we see that 02 x 0 3 x . . x D(x-1) is an open minimal sufficient search set for B2 x B3 x . x B ( K - I ) . Applying Result 2 to (B2 x B3 x . x B ( K - I ) ) x ( B K x U) then shows that 02 x D3 x . x D ( K - ~x) BK x U is a minimal sufficient search set for W. Having characterized an open minimal sufficient search set for W,we now use the fact from Lie algebra theory that a Weyl chamber in the root space of Bk is determined by the inequalities a . w > 0, where Q is the Riesz representation vector for a positive root of the algebra (Varadarajan 1984; Humphreys 1972; Helgason 1962; Bachman and Narici 1966). For Lie algebra Bk there are k2 positive roots of the form &?; and &(& f i j ) , where the & are basis vectors in root space. We choose the roots i i and 2i f i j , with i < j, to be positive. Identifying these basis vectors with the hidden layer bias weight space positive coordinate axes gives us the following three sets of equations
-
Wlio>O
--
for l < i < M l , 2 < I < ( K - l )
wlio-w~j0>O
for l < i < j < M ~ ,2 < I < ( K - 1 )
wlio+wl,o>O
for 1 l i < j < M I , 2 5 I < ( K - 1 )
However, the last set of inequalities is redundant, given the first set. Also, some of the inequalities in the second set are redundant. For example, if W I N > W I Nand W I N > W I W , then there is no need for the condition that WIN > WIN.Thus, we are left with inequalities 4.1 describing an open minimal sufficient search set in the form of a wedge. Note that this is a wedge because any positive real multiple of a member of the set is also a member of the set (as opposed to a cone, for which any real multiple must also be in the set). The cone interior described by inequalities 4.2 is constructed by simply breaking the wedge across the hyperplane through the origin perpendicular to the bias weight axis of the first output layer unit, throwing away the portion of the wedge that intersects this hyperplane, and then rotating the bottom (negative) half-space half of the broken wedge by 180". We can do this since the bias weights of the output units are unaffected by the group. 0 Note that the wedge equations can be summarized by the simple statement that the weight vectors within (in the interior of) the fundamental wedge have all of their hidden layer bias weights positive and that the bias weight of each hidden unit is larger than the bias weight of the unit directly to its right. A similar statement holds for the cone. Also note that to turn these open minimal sufficient search sets into minimal sufficient search sets, we would have to add certain (but not all)
Feedforward Neural Network Error Surfaces
923
points on their boundaries. For example, points with Wljo = WI,O for all units of each hidden layer would have to be added. Thus, for practical applications, we might simply want to use 2 inequalities in 4.1 and 4.2. Note that the images of a minimal sufficient search set S under different transformations in G are, except for certain points on their boundaries, disjoint. The entire weight space thus consists of the union of these sets
w = ug[SI gEG
As a result of this fact, and of the fact that the equioutput transformations themselves (namely, the elements of G) preserve not only the output of the network for all inputs but, thereby, the value of the network performance function as well, the network performance function is uniquely determined everywhere by its behavior on a minimal sufficient search set. Unfortunately, the manner in which the behavior of a performance function in a wedge copy is determined from the behavior of that function within the fundamental wedge is not simple, since the hyperplanes that bound the wedge are not planes of symmetry for the transformations of G. That they are not is easy to see, since if w and w' are weight vectors that have all of their components equal except for one hidden unit bias weight differing in sign, or the bias weights of two adjacent hidden units interchanged (i.e., points placed symmetrically with respect to one of the bounding hyperplanes of the fundamental wedge), then, in general, there will be no transformation g E G such that g(w) will equal w' (since the other weights involved in a sign flip or interchange transformation are not properly modified by this hyperplane reflection). Thus, in general, Fk(w)] will not equal F ( w ' ) . Understanding the relationship between the symmetries of G and the geometry of the fundamental wedge (or other minima\ sufficient search sets) would seem to be an issue worthy of further investigation. In this section we have examined one ramification of the geometric structure of weight space. Namely, the fact that searches for optimum weight vector values can be confined to an extremely small portion of the weight space. In the next section we consider another ramification of the group G. 5 Spherical Throngs of Optima
Another fact about the transformations in the group G is that they are isometries. Thus, for any w E W and any g E G,
Ig(w)I = IWI That this is so, is easy to see because the only effect of applying any combination of sign flips and interchanges is to change the signs and
924
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
permute the components of the weight vector, neither of which affect the Euclidean length of the vector. Given that the elements of G are isometries, if W*is a finite weight vector which optimizes the network performance function, then the points obtained by applying the elements of G to w*all lie on the same sphere around the origin in weight space. In general, these points are all different and there are #G of them. We call such a set of points a spherical throng of optima. Note that the copies of an optimal weight vector w*in its spherical throng will, on average, have half of their weights modified from their corresponding values in w*by a transformation in G. This is easy to see, since half of the permutation and sign flip compositions change more than half of the weights and half fewer than half. Thus, the copies of w* in its throng will, in general, be scattered all over the sphere. Of course, in rare cases (such as where all of the weights have the same value), these copies are not scattered widely; but in general they are. It is easy to see that, given any member w of the throng, the nearest neighbor member on the sphere will typically be a vector that is the same as w except for a single sign flip or interchange transformation. Thus, nearest neighbors in the throng will tend to have most of their components the same. The areal density of a spherical throng of optima depends upon the magnitude of w*,since the number of members of the throng is, in general, #G, which depends only on the architecture of the network. If this magnitude (Iw*I)is small, then the density will be high. Extensive experience with multilayer perceptrons at HNC, Inc. has shown that the lengths of apparently near-optimal weight vectors rarely, if ever, have lengths greater than fl,where again q is the number of weights in the network. In other words, the rms weight values in near-optimum weight vectors are typically considerably less than 1.0 in magnitude (exactly why this is so is unknown). Thus, it might be that these throngs are not only large, but often dense as well. 6 Implications
In this section we consider some implications of the results of Sections 2, 3, 4, and 5. The existence of simple formulas for minimal sufficient search sets raises the question of whether such formulas will be of use in learning. They probably will not. In the case of gradient descent learning they would not be of use, since if we are going downhill, and happen to cross the boundary of the minimal sufficient search set, we should just continue the descent. Even if we wanted to move to the corresponding point of the performance surface within the fundamental wedge we could not do so, since (as pointed out in Section 4) we do not yet have a formula for finding this point.
Feedforward Neural Network Error Surfaces
925
For learning methods that employ stochastic jumping, rule-based weight modification, or another nongradient descent method, it might seem to be of use to constrain the hidden layer bias weights so as to force the network weight vector to remain within the fundamental wedge (or some other minimal sufficient search set). However, this is not really true, as the following example shows. Imagine a simple performance surface with one and only one finite minimum (located near, but not at, the origin) within the fundamental wedge. The goal is to find a weight vector within 6 distance of this minimum. Suppose that a simple unconstrained discrete-time gaussian random weight space search were being used to find this minimum. Then there would appear to be a search speed-up of #G to be gained by constraining the search to an equivalent search process within a minimal sufficient search set. However, this is an illusion, because the unconstrained search process is not trying to find a single minimum (as the constrained process is). It need only find one of #G equivalent copies of the minimum. Therefore, both searches will have the same expected number of steps. Thus, we conclude that knowing the geometry of a minimal sufficient search set has no obvious benefit for learning. With respect to spherical throngs of optima, we speculate that gradient descent learning may be aided by the fact that most learning procedures follow the tradition of starting the weight vector components at random values chosen over a small interval centered at zero. Starting near the origin has more than one benefit. First, starting near the origin causes the “approximating surface” of the network to start out nearly “flat”-with its initial output value near zero everywhere. As the training process proceeds this initially flat approximating surface begins to “crinkle up” as it tries to fit the training data. Thus, starting the weight values near zero provides a parsimonious surface of initially nearly zero curvature. Another reason why this tradition is so apt is that the usual activation functions tanh and (1 e-”)-’ have all of their significant behavior near zero. Thus, one would naturally expect that, if the inputs to the network tend to be small in magnitude, large weight values would be needed only rarely. As mentioned above, anecdotal evidence suggests that this is what occurs in many practical situations. Geometrically, when gradient descent training begins with an initial weight vector near the origin, we conjecture that the learning process consists of an initial, generally outward, movement from the origin to a radius at which a spherical throng of optima is encountered, followed by a “homing in” process that guides the weight vector toward an optimum. If this conjecture is correct, and if, as we suspect, many practical problems have a performance surface with a spherical throng of optima located at a relatively small radius, then this dense shell of optima may be a relatively “easy” target to hit. In other words, in contrast to the typical optimization situation (e.g., in linear programming, combinatorial
+
An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen
926
optimization, or unconstrained optimization), where we are searching a high-dimensional space for a single optimum (or one of a small number of optima), here we are searching a high-dimensional space for any one of a vast multitude of optima. This may partially explain why the training of multilayer perceptron networks with thousands or even millions of adaptive weights is often practically feasible (a fact that we now take for granted, but which, a priori, is rather surprising). Acknowledgments We thank the referees for several helpful comments, including the observation that discontinuous equioutput transformations can exist (this fact was also pointed out to us by Arcady Nemirovskii). Thanks also to Luis Almeida, Charles Fefferman, Lee Giles, David Hestenes, Margarita Kuzmina, Irina Surina, and Robert Williamson for valuable discussions. References Bachman, G., and Narici, L. 1966. Functional Analysis. Academic Press, New York. Becker, S., and Hinton, G. E. 1992. Self-organizingneural network that discovers surfaces in random-dot stereograms. Nature (London) 355, 161-163. Bishop, C. M. 1991. Improving the generalization properties of radial basis function neural networks. Neural Comp. 3,579-588. Bishop, C. M. 1990. Curvature-driven smoothing in backpropagation neural networks. Proc. of the lnternational Neural Network Conf., Paris. 2, 749-752. Kluwer, Dordrecht. Broomhead, D. S., and Lowe, D. 1988. Multivariable function interpolation and adaptive networks. Complex Syst. 2,321-355. Carpenter, G . A., Grossberg, S., and Reynolds, J. H. 1991. ARTMAP: Supervised real-time learning and classification of nonstationary data by a selforganizing neural network. Neural Networks 4, 565-588. Chen, A. M., and Hecht-Nielsen, R. 1991. On the geometry of feedforward neural network weight spaces. Proc. Second 1EE International Conference on Neural Networks, 1-4. IEE Press,London. Hartmann, E. J., Keeler, J. D., and Kowalski, J. M. 1990. Layered neural networks with gaussian hidden units as universal approximations. Neural Comp. 2, 210-215.
Hecht-Nielsen, R. 1992. Theory of the backpropagation neural network. In Neural Networks for Human and Machine Perception, Volume 2, H. Wechsler, ed., pp. 65-93. Academic Press, Boston, MA. Hecht-Nielsen, R. 1991. Neurocomputing. Addison-Wesley, Reading, MA. Hecht-Nielsen, R. 1990. On the algebraic structure of feedforward network weight spaces. In Advanced Neural Computers, R. Eckmiller, ed. ElsevierNorth Holland, Amsterdam.
Feedforward Neural Network Error Surfaces
927
Hecht-Nielsen, R., and Evans, K. M. 1991. A method for error surface analysis. In Theoretical Aspects of Neurocomputing, M. NovAk and E. Pelikh, eds., pp. 13-18. World Scientific, Singapore. Helgason, S. 1962. Differential Geometry and Symmetric Spaces. Academic Press, New York. Humphreys, J. E. 1972. Introduction to Lie Algebras and Representation Theory. Springer-Verlag,New York. Linsker, R. 1988. Self-organization in a perceptual network. IEEE Computer Mag. 21, 105-117. Moody, J,, and Darken, C. J. 1989. Fast learning in networks of locally-tuned processing units. Neural Comp. 1, 281-294. Poggio, T., and Girosi, F. 1990. Regularization algorithms for learning that are equivalent to multilayer networks. Science 247,978-982. Reilly, D. L., Cooper, L. N., and Elbaum, C. 1982. A neural model for category learning. Biol. Cyber. 45, 35-41. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning internal representations by error propagation. In Parallel Distributed Processing, D. E. Rumelhart and J. L. McClelland, eds. MIT Press, Cambridge, MA. Sussmann, H. J. 1992. Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks 5, 589-593. Varadarajan, V. S. 1984. Lie Groups, Lie Algebras, and Their Representations. Springer-Verlag, New York. Weyl, H. 1946. The Classical Groups. Princeton University Press, Princeton.
Received 27 March 1992; accepted 15 March 1993.
This article has been cited by: 2. Christopher DiMattina, Kechen Zhang. 2010. How to Modify a Neural Network Gradually Without Changing Its Input-Output FunctionalityHow to Modify a Neural Network Gradually Without Changing Its Input-Output Functionality. Neural Computation 22:1, 1-47. [Abstract] [Full Text] [PDF] [PDF Plus] [Supplementary material] 3. Haikun Wei, Jun Zhang, Florent Cousseau, Tomoko Ozeki, Shun-ichi Amari. 2008. Dynamics of Learning Near Singularities in Layered NetworksDynamics of Learning Near Singularities in Layered Networks. Neural Computation 20:3, 813-843. [Abstract] [PDF] [PDF Plus] 4. Manfred M Fischer. 2006. Neural Networks: A General Framework for Non-Linear Function Approximation. Transactions in GIS 10:4, 521-533. [CrossRef] 5. Shun-ichi Amari , Hyeyoung Park , Tomoko Ozeki . 2006. Singularities Affect Dynamics of Learning in NeuromanifoldsSingularities Affect Dynamics of Learning in Neuromanifolds. Neural Computation 18:5, 1007-1065. [Abstract] [PDF] [PDF Plus] 6. V. Rossi, J.-P. Vila. 2006. Bayesian Multioutput Feedforward Neural Networks Comparison: A Conjugate Prior Approach. IEEE Transactions on Neural Networks 17:1, 35-47. [CrossRef] 7. Masahiro Kimura . 2002. On Unique Representations of Certain Dynamical Systems Produced by Continuous-Time Recurrent Neural NetworksOn Unique Representations of Certain Dynamical Systems Produced by Continuous-Time Recurrent Neural Networks. Neural Computation 14:12, 2981-2996. [Abstract] [PDF] [PDF Plus] 8. H. White, J. Racine. 2001. Statistical inference, the bootstrap, and neural-network modeling with application to foreign exchange rates. IEEE Transactions on Neural Networks 12:4, 657-673. [CrossRef] 9. J.-P. Vila, V. Wagner, P. Neveu. 2000. Bayesian nonlinear model selection and neural networks: a conjugate prior approach. IEEE Transactions on Neural Networks 11:2, 265-278. [CrossRef] 10. M. Gori, Ah Chung Tsoi. 1998. Comments on local minima free conditions in multilayer perceptrons. IEEE Transactions on Neural Networks 9:5, 1051-1053. [CrossRef] 11. M. Gori, M. Maggini. 1996. Optimal convergence of on-line backpropagation. IEEE Transactions on Neural Networks 7:1, 251-254. [CrossRef] 12. Frans M. Coetzee , Virginia L. Stonick . 1995. Topology and Geometry of Single Hidden Layer Network, Least Squares Weight SolutionsTopology and Geometry of Single Hidden Layer Network, Least Squares Weight Solutions. Neural Computation 7:4, 672-705. [Abstract] [PDF] [PDF Plus]
13. Věra Kůrková , Paul C. Kainen . 1994. Functionally Equivalent Feedforward Neural NetworksFunctionally Equivalent Feedforward Neural Networks. Neural Computation 6:3, 543-558. [Abstract] [PDF] [PDF Plus] 14. D Saad. 1994. Journal of Physics A: Mathematical and General 27:8, 2719-2734. [CrossRef]
Communicated by Stephen J. Nowlan
Rational Function Neural Network Henry Leung Surface Radar Section, Radar Division, Defence Research Establishment Ottawa, Ottawa, Ontario, Canada KIA OK2
Simon Haykin Communications Research Laboratoy,McMaster University, Hamilton, Ontario, Canada LBS 4Kl
In this paper we observe that a particular class of rational function (RF) approximations may be viewed as feedforward networks. Like the radial basis function (RBF) network, the training of the RF network may be performed using a linear adaptive filtering algorithm. We illustrate the application of the RF network by considering two nonlinear signal processing problems. The first problem concerns the one-step prediction of a time series consisting of a pair of complex sinusoid in the presence of colored non-gaussian noise. Simulated data were used for this problem. In the second problem, we use the RF network to build a nonlinear dynamic model of sea clutter (radar backscattering from a sea surface); here, real-life data were used for the study.
Neural networks are nonlinear parametric models that can approximate any continuous input-output relation. The problem of finding a suitable set of parameters that approximates an unknown relation is usually solved using a learning algorithm. The problem of learning a mapping between an input and an output space is equivalent to the problem of synthesizing an associative memory that retrieves the appropriate output pattern when presented with the associated input pattern, and generalizes when presented with new inputs. A classical framework for this problem is approximation theory. Neural networks functioning as approximators of general maps are currently under intense investigation (Poggio and Girosi 1990; Moody and Darken 1988). A very important class of applications is nonlinear signal processing, particularly the prediction of a chaotic time series. In this application, a neural network predictor may tell the difference between purely random and deterministic processes, and in the latter case allow longer time predictions. The practical engineering problem that we attempt to solve using neural networks in this paper is one such Neural Computation 5,928-938 (1993) @ 1993 Massachusetts Institute of Technology
Rational Fundion Neural Network
929
example. In Leung and Haykin (19901, we demonstrated that sea clutter (electromagnetic backscattering from a sea surface) contains a strange attractor. In other words, sea clutter may permit a chaotic description. To detect a small target in an ocean environment, a common technique is to build a model for sea clutter, and then suppress the clutter (interference) by means of that model. Since a neural network is powerful in predicting a chaotic sequence (Lapedes and Farber 1987) (in other words, it can provide a good model for a chaotic sequence), we would like to use a neural network to build a model for sea clutter to perform adaptive radar detection. Approximation theory deals with the problem of approximating a function f(x) of an input vector x by another function F(w,x) having a fixed number of parameters denoted by the vector w. The parameters of F are chosen so as to achieve the best possible approximation of the function f. As pointed out in Poggio and Girosi (1990), there are three main problems of designing a neural network from the point of view of approximation theory: 1. Which particular approximation F to use.
2. For a given choice of F, which algorithm to use for finding the optimal values of the parameters that characterize F. 3. For a selected algorithm, which efficient implementation to use.
Problem 1 relates to the computational power of a neural network. A basic requirement is that the network should be a universal approximator; that is, it should approximate any continuous function. The conventional multilayer perceptron (MLP) is one such example (Funahashi 1989), although it is not usually designed for this purpose. The radial basis function (RBD neural network (Poggioand Girosi 1990; Moody and Darken 1988; Broomhead and Lowe 1988) is another example. The RBF network is particularly powerful from the point of view of approximation theory. Not only is it a universal approximator, but it also has many nice function approximation properties such as the existence of best approximation (Girosi and Poggio 1990) and the ability of regularizing an ill-posed problem (Poggioand Girosi 1990); the latter feature cannot be found in a conventional MLP. These properties make it possible for the RBF network to exhibit a robust performance even in noisy data. Problem 2 relates to the efficiency of a neural network. It is pointed out in Moody and Darken (1988) that a major handicap of an MLP is the inefficiency of the backpropagation (BP) algorithm commonly used to train it. Even when the MLP is implemented using a faster optimization procedure such as the conjugate gradient algorithm, it is still painfully slow. The convergence is so slow that training usually requires repeatedly feeding the training data into the network. It is suggested in Moody and Darken (1988) that while this approach makes sense for
930
Henry Leung and Simon Haykin
“off-line” problems, it is probably too inefficient for solving many realtime problems found in such areas as adaptive signal processing and biological information processing. In this paper we propose a network architecture that uses a rational function (RF)(Braess 1986) to construct a mapping neural network. Our motivation for using a rational function is 5-fold: 1. The class of rational functions has been proven to universally approximate real-valued functions having certain smoothness properties (Braess 1986). It is a global approximator as a polynomial function (Barron and Barron 1988). A rational function usually requires fewer parameters than a polynomial (lower complexity) and it is a better extrapolator than a polynomial (better generalization ability) (Lee and Lee 1988). However, the complexity problem is still considered as a major drawback of the rational function approach. One suggestion to handle this problem is through the self-organization method described in Farlow (1984). 2. Rational functions constitute a well-developed class of approxima-
tion techniques. Many useful properties such as the existence of best approximation are already known; knowledge of these properties can help us better understand the ability and the performance of a neural network based on a rational function. 3. The parameters of a rational function can be computed by a linear adaptive filtering algorithm such as the recursive least square (RLS) algorithm. Consequently, an RF neural network can reach a global optimum without using a nonconvex optimization technique.
4. An RF network can be implemented efficiently on a systolic array by using a linear adaptive filtering algorithm such as the recursive least squares-QR decomposition (RLSQRD) algorithm (Haykin 1991 or its fast versions (Haykin 1991; Cioffi 1990);it is therefore well-suited for solving real-time problems. 5. Rational functions can model many real-life problems [e.g., optical
transformations (Pitas and Venetsanopoulos 19901, interpolation of
TV image sequences (Pitas and Venetsanopoulos 1990), input resistance of cascaded resistance networks (Daiuto etal. 1989), and image propagation for two inward-facing parabolic mirrors (Daiuto et al. 1989)l. Points 1 to 2 relate to Problem 1 of designing a neural network from the viewpoint of approximation theory mentioned previously. Points 3 and 4 provide possible solutions to Problems 2 and 3, respectively. Point 5 emphasizes the wide applicability of rational functions for solving reallife problems.
Rational Function Neural Network A rational function R” shown by
+
931
R is the quotient of two polynomials, as
where X I , x2, . . . ,xm are the scalar inputs applied to the network. The set ( X I , x2, . . . ,x,) forms a vector x in R”, and y is the value of the mapping of that vector in the range R. The representation of equation 1 is unique, up to constant factors in the numerator and the denominator polynomials. The rational function must clearly have a finite order for it to be useful in solving a real-life problem. Let the order of the numerator polynomial be a and that of the denominator polynomial be ,f3. Then, we say that the rational function has order (a,p), and so denote it by Rap. Assume that we have an (&,@rational function and the desired response is d. To get the best approximation, we seek a rational function belonging to Rap that solves the following minimization problem: t
m i n x Id(i) - Rap(i)12
(2)
i=l
where t is the total number of examples available for learning. To obtain the optimum estimation of the coefficients of a rational function neural network, we may first transform equation 2 to a leastsquares problem of solving a set of nonlinear equations, that is,
. = ...... . - ...... . - ......
Note that the number of equations should be larger than the number of variables (i.e., we have an overdetermined system of equations), which is a general setting for least-squares estimation. Solving these nonlinear equations directly is a difficult job, which usually requires convex optimization. To avoid this, we use cross-multiplication to move the denominator on the right side of equation 3 to the left side. Also, without loss of generality, bo can be assumed to be unity. After rearrangement of terms, we obtain a new set of linear equations (linear in the parameters).
932
Henry bung and Simon Haykin
Expressing these equations in matrix notation, we have
We now have a linear learning problem. Moreover, the minimum norm solution of the least-squares estimation problem can be obtained by any linear adaptive filtering algorithm such as the least mean squares (LMS) algorithm or the recursive least squares (RLS) algorithm. In Table I, we present a summary of the adaptive algorithm for a rational function using the RLS algorithm. Note that the nonlinearity of the RF neural network as described herein manifests itself by virtue of the nonlinear way in which the vector u(n) is defined in terms of the input data x l ( n ) , . . . ,x,,,(n) and the desired response d(n). The weight vector w(n) of the network is defined in terms of the numerator and denominator coefficients of the rational function. Next we try to show how a neural network based on rational functions can be used to represent a multidimensional mapping from R"' to R". The network has m input units and n output units. There are hidden layers that form all the polynomial combinations needed to construct the rational function of interest. Each output unit combines all the outputs of the hidden units to form a rational function representation as shown in equation 1. The input layer of a rational function neural network consists of a set of m nodes, into which we feed the components of the m-dimensional vector ( X , , X ~ , . .. ,xm). The first hidden layer is designed to form all the second-order components that are common to the numerator and d e nominator polynomials of the rational function. The des'ired response is also fed into this hidden layer to form second-order components. The second hidden layer is then assigned to the formation of third-order components, and so on for all the other hidden layers. Consider, for example, a rational function with a highest order of seven in either the numerator or denominator polynomial. We will then have six hidden layers to get all the polynomial combinations. Basically, the hidden layer units multiply their inputs and all incoming hidden weights are equal to l, and
Rational Function Neural Network
933
Table 1: RLSBased Rational Function Neural Network Algorithm. Initialize the algorithm by setting
P(o) = 6-'I w(0) = 0
6 = small positive constant
For each instant of time, n = 1,2,.. ., compute
k(n) =
X-'P(n
1
- l)u(n) - l)u(n)
+ X-'uT(n)P(n
r(n) = d(n) - wT(n- l)u(n) w(n) = w(n - 1) + k(n)r(n) P(n) = X-'P(n - 1) - X-'k(n)uT(n)P(n
- I)
where
and where x l ( n ) ,. . . ,x,(n) are the input data, d ( n ) is the desired response, and w(n) is the adjustable parameter vector to be computed.
nonadaptive while the output layer units calculate a weighted sum of their inputs, with the weights adaptive and corresponding to the rational function coefficients. For the purpose of illustration, an RF neural network of general character is depicted in Figure la. We have also included Figure l b to illustrate that there are direct connections from all hidden units of the network to the output layer. Note that an RF neural network does not feed back the training error but rather the desired response. Another noteworthy point is that those terms that contain a d(i), the desired response, will not have any d(i) after the learning period. In actual fact, such terms constitute the denominator polynomial of the rational function of interest. We next illustrate the application of RF neural networks by considering two different nonlinear signal processing problems. We first apply the RF network to a nonlinear prediction problem, for which the signal used is described by the following equation: x ( t ) = p i ( 5 ) t + 2=i(-5)f + n ( t )
(5)
The additive colored noise n ( t ) is generated by passing a white uniformly distributed process through a finite-duration-impulseresponse(FIR)fil-
Henry b u n g and Simon Haykin
934
I
a
I
dl
dN
'1
'2 :
OUTPUT LAYER
HIDDEN LAYERS
INPUT LAYER
XN
DENOMINATOR NODE
0 : NUMERATOR NODE (2 : NODES EXISTING IN TRAINING ONLY
Figure 1: (a) A rational function neural network. (b) A portion of the RF network illustrating connections within the network. ter with impulse response 15
h ( t ) = Cajs(t- i)
(6)
i=O
where the filter coefficients are chosen to be the same as those used in Papadopoulos and Nikias (1990). They are {0.5,0.6,0.7,0.8,0.7,0.6,0.5, 0.0, 0.0,0.5,0.6,0.7,0.8,0.7,0.6,0.5}. The signal-to-noise ratio is set to be 0 dB. We use an Rll predictor to perform the one-step-ahead prediction, and compare the result with that obtained using a linear predictor. The input vector for both predictors has a dimension of 4, and 50 samples are used for training. The training is carried out using the RLS algorithm. After training, we present new data (not in the training sets), which are
Rational Function Neural Network
935
generated by the same model, to test both predictors. The mean and standard deviation of the prediction error for the RF predictor are 0.37 and 0.24, respectively. For the linear predictor, the mean and the standard deviation of the prediction error are 0.52 and 0.37,respectively. The mean prediction error for the RF predictor is about 3 dB less than that for the linear predictor. For the second experiment, we use the RF network to study the sea clutter modeling problem using real-life radar data. In particular, the sea clutter data used were obtained using the IPIX radar (Krasnor etal. 1989) located at a site on Cape Bonavista, Newfoundland, Canada. The radar was used to probe the Ocean surface along a radial line. The radar pulse repetition rate was 2 kHz. The sea state was about 1.57 m. The modeling begins by using the neural network as a predictor. The way to do it is very simple. The number of input neurons depends on the embedding dimension of the sea clutter process, which has been shown experimentally to be an integer greater than 6.5 (Leung and Haykin 1990), and the output layer consists of a single neuron that provides the predicted value. After the learning phase is completed, the network is frozen, that is, the connection weights are not allowed to change any more because the dynamic process is assumed to be time-invariant. The rational function neural network used here had a ( 2 , l ) structure, which may be justified as follows. Obviously, order (1,O) cannot be used since it is just a linear model. Also, order (1,l) is not suitable by a recent discovery (Daiuto et al. 1989) that this structure cannot produce chaotic behavior since it is not sensitive to initial conditions. Thus, the simplest rational function which can generate a chaotic behavior is the ( 2 , l ) structure. Of course, a higher order structure also has the potential to produce a chaotic behavior, but the complexity would be greatly increased, especially when the dimension of the input data is high. Introducing too many parameters is not recommended by the informational Occam’s razor. For this particular data set, we chose an embedding dimension of 7. The second layer then contains 49 elements, which form the second-order components of the polynomials. Thus, there are a total of 65 parameters for estimation in this structure. The resulting training error is shown in Figure 2. We observe that the training speed in terms of number of training samples of the RF network is comparable to that of a RBF network. However, the computational time required for each iteration for the RF network is much less than that for the RBF network. To confirm the validity of the model, we cannot simply look at the training error. A small training error tells us only that the network fits the training data, a task that in principle can be accomplished by any model with sufficient parameters. More specifically, after the learning is completed, we have to study the ability of the network to generalize. To do so, we present the network with data not seen before, and observe the performance of the network. If the prediction error is reasonably small,
Henry Leung and Simon Haykin
936
Training error of sea clutter prediction using RBF and RF network
-
rational function
--radial basis function
'0
10
20
30
40 50 60 Number of training sample
70
80
90
100
Figure 2 Learning curve of RF and RBF network for sea clutter prediction. The y-axis is the absolute value of the normalized training error (i.e., the magnitudes of training error are scaled into the range [0,1]).The x-axis represents the number of samples fed into the network for training.
we can then say that the model is an appropriate one. In this paper, we use the onestepahead prediction to demonstrate the generalization ability of the RF network. The normalized prediction error used here is a dynamic range independent measure (Casdagli 1989), which is defined as the absolute prediction error divided by the standard deviation of the data sequence. The prediction errors are computed as an average of 50 trials and each trial consists of 50 points for prediction. The mean and the standard deviation of the normalized prediction error for the RF network are 0.327 and 0.254,respectively. The same procedure was also applied to the RBF neural network (Moody and Darken 1988) for comparison. (We do not choose a conventional MLP for comparison, because this network requires repeated training that is not suitable for OUT real-time signal processing problem.) The mean and the standard deviation of the
Rational Function Neural Network
937
normalized prediction error for the RBF network are 0.329 and 0.223, respectively. We observe that the prediction error performance of the RF network is about the same as the RBF network. However, the complexity of the RF network is lower than that of the RBF network. Not only does the RBF network need to compute the complicated Euclidean distance of high-dimensional vectors and use a time-consuming k-means algorithm, but it also needs 200 to 300 hidden units to obtain a similar performance. Based on the mathematical treatment presented and the nonlinear signal processing applications described herein, we suggest that a neural network based on a rational function approximation is a reasonably good mapping network, especially for real-time applications.
References Barron, A. R., and Barron, R. L. 1988. Statistical learning networks A unifymg view. In Symposium on the Interface: Statistics and Computing Science, E . Wegman, ed., pp. 192-203. American StatisticalAssociation, Washington,
Dc. Braess, D. 1986. Nonlinear Approximation Theory. Springer-Verlag,Berlin. Broomhead, D. S., and Lowe, D. 1988. Multivariable functional interpolation and adaptive networks. Complex Systems 2,321-355. Casdagli, M. 1989. Nonlinear prediction of chaotic time series. Physica D 35, 335-356. Cioff?, J. M. 1990. The fast adaptive ROTORS RLS algorithm. IEEE Trans. Acoustics, Speech, Signal Process. 38(4), 631-653. Daiuto, B. J., Hartley, T. T., and Chicatelli, S. P. 1989. The Hyperbolic Map and Applications to the Linear Quadratic Regulator. Ledure Notes in Control and Information Sciences, Vol. 110, M. Thoma and A. Wyner, eds. SpringerVerlag, Berlin. Farlow, S., ed., Self-organizing Methods in Modeling. Marcel Dekker, New York. Funahashi, K. 1989. On the approximate realization of continuous mapping by neural network. Neural Networks 2, 183-192. Gimsi, E, and Poggio, T. 1990. Networks and the best approximation property. Biological Cybernetics 63, 169-176. Haykin, S. 1991. AdaptiveFilter Theory, 2nd ed. h n t i c e Hall, Englewood Cliffs, NJ. Krasnor, C., Haykin, S., Cume, B., and Nohara, T.1989. A dual-polarized radar system. Presented at the International Conference on Radar, Paris, France. Lapedes, A., and Farber, R. 1987. Nonlinear signal processing using neural networks : Prediction and system modelling. Los Alamos National Laboratory, LA-UR-87-2662. Lee, K., and Lee, Y. C. 1988. System modeling with rational function. Los Alamos National Laboratory Report. hung, H., and Haykin, S. 1990. Is there a radar clutter attractor? Appl. Phys. Lett. 56(6), 593-595.
938
Henry h u n g and Simon Haykin
Moody, J., and Darken, C. 1988. Learning with localized receptive fields. In Proceedings of the 1988 Connectionist Models Summer School, D. Touretzky, G . Hinton, and T. Sejnowski, eds., pp. 133-143. Morgan Kaufmann, San Mateo, CA. Papadopoulos, C. K., and Nikias, C. L. 1990. Parameter estimation of exponentially damped sinusoid using higher-order statistics. IEEE Trans. Acoustic, Speech, Signal Process. 38(8), 1424-1436. Pitas, I., and Venetsanopoulos, A. N. 1990. Nonlinear Digital Filters-Principles and Applications. Kluwer Academic Publishers, Dordrecht. Poggio, T., and Girosi, F. 1990. Networks for approximation and learning. Proc. IEEE 78(9), 1481-1497. Received 5 March 1991; accepted 26 February 1993.
This article has been cited by: 2. Gecynalda S. da S. Gomes, Teresa B. Ludermir, Leyla M. M. R. Lima. 2010. Comparison of new activation functions in neural network for forecasting financial time series. Neural Computing and Applications . [CrossRef] 3. John W. Bandler, Mostafa A. Ismail, Jos� E. Rayas-S�nchez. 2001. Broadband physics-based modeling of microwave passive devices through frequency mapping. International Journal of RF and Microwave Computer-Aided Engineering 11:3, 156-170. [CrossRef] 4. A. Gati, M. F. Wong, G. Alquie, V. Fouad Hanna. 2000. Neural networks modeling and parameterization applied to coplanar waveguide components. International Journal of RF and Microwave Computer-Aided Engineering 10:5, 296-307. [CrossRef] 5. E. Gelenbe, Zhi-Hong Mao, Yan-Da Li. 1999. Function approximation with spiked random networks. IEEE Transactions on Neural Networks 10:1, 3-9. [CrossRef] 6. N. Sbirrazzuoli, D. Brunel, L. Elégant. 1997. Neural networks for kinetic parameters determination, signal filtering and deconvolution in thermal analysis. Journal of thermal analysis 49:3, 1553-1564. [CrossRef] 7. Fang Wang, Qi-Jun Zhang. 1997. Knowledge-based neural models for microwave design. IEEE Transactions on Microwave Theory and Techniques 45:12, 2333-2343. [CrossRef] 8. A. Materka, S. Mizushina. 1996. Parametric signal restoration using artificial neural networks. IEEE Transactions on Biomedical Engineering 43:4, 357-372. [CrossRef]
Communicated by john Platt and JohnLazzam
On an Unsupervised Learning Rule for Scalar Quantization following the Maximum Entropy Principle Marc M. Van Hulle’ Dominique Martinez+ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, M A 02139 USA
A novel unsupervised learning rule, called Boundary Adaptation Rule (BAR),is introduced for scalar quantization. It is shown that the rule maximizes information-theoreticentropy and thus yields equiprobable quantizations of univariate probability density functions. It is shown by simulations that BAR outperforms other unsupervised competitive learning rules in generating equiprobable quantizations. It is also shown that our rule can do better or worse than the Lloyd I algorithm in minimizing average mean square error, depending on the input distribution. Finally, an application to adaptive non-uniform analog to digital ( A D )conversion is considered.
1 Introduction The main objective of scalar and vector quantization is to build discrete approximations to continuous functions such as input probability density functions (p.d.f.s)(Kohonen 1990). Two criteria have been widely used for designing quantizers: the minimization of the average distortion due to quantization and the maximization of information-theoreticentropy, i.e., ensure that each of the quantization intervals is used equally frequently in encoding the input signal (Ahalt et al. 1990). In general, these two criteria are not equivalent and a particular quantizer is only optimal with respect to a given design criterion. In case the input p.d.f. is not known a priori, the quantizer is constructed by a training process. We will restrict ourselves to this case only. In case of data representation and compression, the most often used techniques are batch algorithms based on the Lloyd I algorithm (see, e.g., Linde et al. 1980; Gersho and Gray 1991) and are aimed at minimizing average distortion. The major drawback of batch algorithms is that the ‘Present address: Laboratorium vmr Neuro- en Psychofysiologie, K. U. Leuven, Campus Gasthuisberg, Herestraat, B-3OOO Leuven, Belgium. +Presentaddress: Laboratoire d’Automatique et d’Analyse des SystPmes-CNRS, 7 Av. du Col. Roche, 31077 Todouse, France. Neural Computation 5,939-953 (1993) @ 1993 Massachusetts Institute of Technology
940
Marc M. Van Hulle and Dominique Martinez
design of the quantizer only begins after the entire training set is available. By consequence, these algorithms are not able to accommodate ”on-line” changes in the input p.d.f. A number of researchers have developed unsupervised competitive learning (UCL) algorithms for training artificial neural networks (ANNs) for the purpose of scalar and vector quantization. The quantizer is built “on the fly,” after the presentation of each input sample. In standard UCL, the weights of the network are updated so as to minimize average distortion. The quantization intervals are defined by nearestneighbor classification. However, as enunciated by Grossberg (1976a,b) and Rumelhart and Zipser (1985) among others, one problem with standard UCL is that some neurons may never win the competition and, therefore, never learn (dead units). In practical applications for data representation and compression, it is essential to add mechanisms to avoid dead units and to ensure an equitable distribution of weights in the input signal space. This has been done in Kohonen learning (Kohonen 1989)by adding a neigborhood to each neuron (Nasrabadi and Feng 1988; Naylor and Li 1988). In another approach, Grossberg (1976a,b) added a ”conscience” to frequently winning neurons to feel “guilty” and reduce their winning rate. Several researchers have introduced various methods inspired by the “conscience” mechanism (for references, see Hertz et al. 1991), usually with the purpose of achieving an equiprobable quantization, and thus maximizing entropy. In this article, an unsupervised learning rule is introduced for scalar quantization. The design criterion adopted is entropy maximization. The rule is completely different since it adapts the boundary points that separate the quantization intervals. Hence it is called Boundary Adaptation Rule (BAR). For this rule, it can be shown mathematically that it maximizes entropy (Van Hulle and Martinez, 1993). Due to this property, BAR can be used in a number of applications such as entropy estimation (Mokkadem 1989), and the formation of nonparametric models of input p.d.f.s (Silverman 1986). Here we show how the rule can be used for building a nonuniform A/D converter that is able to adapt itself to long-term drifts in sensor characteristics and changing environmental conditions. 2 Nonuniform Scalar Quantization
Scalar quantization transforms continuous-valued signals into a discrete number of quantization levels. In uniform quantization, the analog signal range R is partitioned into k equally sized regions, called quantization intervals or partition cells, separated by k - 1 equally spaced boundary points, Such a quantization is optimal only for stationary, uniform input distributions. In the general case, however, quantization should follow the distribution as closely as possible in order to quantize the distribution
Unsupervised Learning Rule for Scalar Quantization
941
efficiently given limited quantization resources. This way, the dynamic range that can be accommodated is significantly increased and a better nonparametric model of the input p.d.f. is obtained. A standard quantizer comprises an encoder and a decoder. The encoder is uniquely determined by the set of quantization intervals and the decoder by an associated set of output values. To formalize this, let x be a scalar input value and p ( x ) its p.d.f. Suppose that we have k nonoverlapping quantization intervals Di for partitioning the analog signal range R: k
R= u D i
with D i n D j = O ,
Vifj
(2.1)
i=l
Quantization intervals are encoded into digital codes. For this, let ActD, be a binary variable indicating code membership so that Act~, (X ) =
1 if x E Di 0 if x 6 Di
(2.2)
The corresponding code is then represented by the k-dimensional vector (Ac~D, ,Actq, . . . ,ActD,). Since the x are drawn from a p.d.f. p ( x ) , the probability of x falling in interval Di satisfies
p(Di) =
JD, ~ ( x dx) = S, Act~,(x)~ ( xdx) = E[Acb,]
(2.3)
with p ( R) = Ebl p( Di) = 1. The efficiency of quantization is proportional to how well the density of the k - 1 boundary points approximates the p.d.f. p ( x ) . Regardless of the type of p.d.f., we want each code to be active with an equal probability E[Act~,l= l / k . This way, the informationtheoretic entropy or channel capacity of the quantizer
is maximized and equal to log, k. In general, this implies a nonuniform quantization of the signal range R. Finally, the digital codes are decoded into analog output values. Define the set C of k output levels
c {yl,yZ,...,yk I yl E DlrYZ E DZi..*,ykE Dk)
(2.5)
as the quantizer’s codebook. 3 Boundary Adaptation Rule
In case of UCL for scalar (vector) quantization, the system compares the present input x with k weights Wi. If wi is “closest” to x then the ith neuron wins the competition and its output represents the code membership function for code i. Different rules exist for modifying the weights.(Hertz
Marc M. Van Hulle and Dominique Martinez
942
al. 1991) but in practice they amount to finding the wis representing the centroids of the quantization intervals; the boundary points are then marked by nearest-neighbor classification. In contrast, our approach does not rely on nearest-neighbor classification. Rather than finding the centroids of the quantization intervals, our BAR directly computes the boundary points. Let 6i-l and 6i be two boundary points that demarcate the interval Di so that Di = [6j-l,6j) for 1 < i < k; in case of D1 and Dk we have that D1 = (-co,&)and Dk I [6k-l, +co).The interval [61,6k-1] is called the dynamic range of the quantizer. Assume that for input x, Ac~D, = 1. We then modify Di by increasing 6;-l and decreasing 6;. In its simplest form, the rule reduces to et
BAR
A6; = q (ACtD,,, - A&,),
1Ii
(3.1)
with 71 the learning rate, a positive scalar. Note that this is an unsupervised learning rule since it does not contain knowledge about the factual code membership of each input x. This rule satisfies the maximum entropy principle since at convergence we have on average that E[A6i] = 0, 1 5 i < k, and from which follows that E[Act~+,l= E[Act~,l.Hence, E[Act~,l= l/k, Vi. The proof of convergence is given in Van Hulle and Martinez (1993). This rule has the disadvantage that, for a given k, the rate of convergence of the boundary points is inversely proportional to the accuracy with which they are defined at convergence (one,). Hence, to define codes with small quantization intervals in R space, a low 11 value is required, which inevitably leads to a slow rate of convergence. A more elegant rule that overcomes this problem and still maximizes entropy is
and
This rule has the advantage of yielding a higher boundary point accuracy there where needed, that is, where the quantization interval is small. This way, for the same average accuracy, the rate of convergence is higher than that of the simplest rule. Note that the rule is slightly different for A&-, since there exists no 6k. The fastest rule is found by updating all boundary points each time an input is presented:
We will refer to this rule by fast BAR (FBAR). For this rule, the rate of convergence is independent of the number of quantization intervals.
Unsupervised Learning Rule for Scalar Quantization
943
4 Comparison with Unsupervised Competitive Learning
We will compare the overall performance of BAR with a number of unsupervised competitive learning rules. The performance of the quantizer will be assessed in three ways: in terms of speed of convergence, quantization entropy I, and codebook utilization {p(ActD,) I 1 I i 5 k}. The quantization entropy is calculated following equation 2.4 with p ( Q ) estimated as (4.1) with t time or iteration step, for T large and after the quantization process has converged. Consider a gaussian p.d.f. with mean 2 = 0.5 and standard deviation a, = 0.15. For k = 32, the temporal evolutions of the boundary points are given in Figure 1A and B for BAR equation 3.2 with q = 0.005, and FBAR with q = 0.00025, respectively. From Figure lA, two observations can be made: (1) The gaussian pad$ is quantized nonuniformly with a higher boundary point accuracy where needed, that is, around X. (2) Convergence is reached in about lo6 time steps. Extensive simulations have shown that, for k quantization intervals and this p.d.f., convergence is reached in about k x 30,000 time steps. The nonuniform transfer characteristic obtained at convergence is given in Figure 1C. The speed of FBAR is revealed in Figure 1B: only 40,000 time steps are needed for this case. Furthermore, this does not depend on the number of intervals used in the quantizer. The entropy performance is plotted in Figure 2A. We observe that the log, k function is closely followed: the error is less than 0.1% everywhere. We have also considered five stochastic unsupervised competitive learning algorithms used as scalar quantizers: standard UCL, Kohonen learning, original Conscience Learning (Conscience 1; DeSieno 1988) and its slightly modified version (Conscience 2; Van den Bout and Miller 19891, and Frequency-Sensitive Competitive Learning (FSCL; Ahalt et al. 1990). The entropy error for these algorithms is at least an order of magnitude larger than ours. We will now apply these rules for quantizing the previous gaussian into k = 32 intervals. Figure 2B shows the corresponding p(ActD,) distributions (codeword utilization schemes) together with that obtained using BAR equation 3.1. These distributions show the accuracy with which each rule approximates the gaussian p.d.f. We observe that BAR yields an excellent approximation of the gaussian p.d.f. (full line); the other versions of BAR yield similar results. Conscience 2 is the next best rule. We also observe that both Kohonen learning and standard UCL underestimate the low density regions and overestimate the high density regions of the gaussian p.d.f., as expected since, for the
interval Di is chosen as yi = (0;+ 0;-,)/2, Vi, with 0,
= 0 and 0, = 1.
Figure 1: Nonuniform quantization of a gaussian pdf. with 3 = 0.5 and ox = 0.15, using k = 32 intervals. (A,B) Temporal evolution of the boundary points using BAR equation 3.2 with 1 = 0.005 (A) and FBAR with 1 = O.OOO25 (B). Starting values are chosen randomly in the interval [0,1). Note the diffemnce in time scale. (C) Transfer characteristic corresponding to A. The output level associated with
Unsupervised Learning Rule for Scalar Quantization
A
945
B
Figure 2: Performance of the adaptive, nonuniform quantizer. (A) Entropy performance of our quantizer as a function of k using BAR eq. (3.2)with 71 = 0.005. The other versions of BAR yield the same performance. (B) P.d.f. estimation performance in terms of p(ActD,) (codebook utilization). Comparison of BAR equation 3.1 (BAR equation 3.2 and FBAR yield similar results) (thick full line) with standard UCL (dots), Kohonen learning (dashes), Conscience learning 1 (thin dot-dashes) and 2 (thick dot-dashes), and FSCL (thin full line). Kohonen learning was started with a neighborhood of 10 neurons; after each lo6 iterations, the neighborhood was reduced by 2 neurons until no neighborhood remained. For Conscience learning 1 and 2, the conscience factor was 10 and 2, as suggested by DeSieno (1988) and Van den Bout and Miller (19891, respectively. For standard UCL, Kohonen learning, FSCL, and Conscience 1, the learning rate was decreased linearly from 0.1 to zero over 5 x lo6 iterations so as to obtain stable quantizations. For Conscience 2 and BAR, it was decreased from 0.01 and 0.002, respectively. Starting values of 11 were chosen for the purpose of obtaining centroid and boundary point traces with comparable noise characteristics. one-dimensional case, the weight distributions are known to be proportional to P ( X ) ~(Ritter /~ and Schulten 1986). FSCL behaves in a similar way since the weight distribution obtained seems to be proportional to P(x)’/’. Conscience 1 results in a p(ActD,) distribution centered around Ilk = 1/32. However, the “conscience” heuristic is not very successful in forcing the Actqs to be equally active. The reason the distribution looks erratic is that in the simulations, all of the ”conscience” was taken by a small number of weights only, leaving the other weights virtually unchanged. Finally, from the temporal behavior of the different rules, we found that FBAR is unparalleled in speed: it was at least 25 times
Marc M. Van Hulle and Dominique Martinez
946
faster than any of the other rules. Beyond that, BAR and Conscience 2 were the fastest rules and Kohonen learning was the slowest. 5 Comparison with Lloyd I Algorithm
Despite the fact that the purpose of BAR is to maximize entropy and not to minimize average distortion, it is instructive to compare the performance of our quantizer with that of the Lloyd I algorithm. Note however that the two design criteria are not equivalent. This is shown in the appendix for the scalar case. A quantizer that minimizes average distortion must satisfy two necessary conditions, given the number of quantization levels and the input p.d.f.: the nearest-neighbor condition and the centroid condition (LloydMax conditions). For reasons of simplicity and mathematical convenience, distortion is often measured in terms of the mean squared error MSE: k
MSE = j = ~
1 Di
( x - yi)2 p ( x ) dx
(5.1)
with yi the centroid (center of mass) of the input p.d.f. that lies in the corresponding interval Di. In practice, MSE is approximated as a statistical average over T discrete time intervals: l T k MSE - C(x[t] (5.2) - C A c t ~ , [ ~t ]j i [ t ] ) ~
T t=l
i=l
In case the p.d.f. is not known, one has to proceed with a training set comprising empirical data. Under this condition, we will compare the performance of our quantizer with the Lloyd I algorithm. In an effort to make a fair comparison, we consider different training sets comprising the same number of samples as needed by FBAR to converge and to determine the average codebook utilization, that is, 50,000 samples. Lloyd I then repeatedly iterates on each of these sets until convergence. The MSE plot as a function of the number of bits is given in Figure 3A for a gaussian with mean 2 = 0.0 and standard deviation ox = 1.0. We have also plotted the optimal MSE results determined numerically by Max (1960)using a priori knowledge of the gaussian. From these results we observe that Lloyd I (thin full line) yields better results than FBAR (thick full line), as expected. However, the Lloyd I results are not as good as the optimal ones (dotted line). Furthermore, Lloyd I needs more computing time than FBAR since about 50 iterations on average were needed for the k = 32 caseand only 1 for FBAR. The codebook utilization schemes are given in Figure 3B for k = 32. We observe that FBAR outperforms Lloyd I in approximating the gaussian. In addition, there is considerable variation in the Lloyd I results: the standard deviation ranges from
k
B
o
s
I
l
1
o
l
I
s
1
z
o
I
2
s
I
Lci
~
C
k
Figure 3 (A) Average mean square e m r (MSE) as a function of number of intervals k. Comparison between FBAR (thick full line), Lloyd I (thin full line), and the optimal [min(MSE)]quantizer (dotted line). Input distribution is a gaussian with 2 = 0.0 and a, = 1.0. For FBAR and Lloyd I, the MSE results are averages over 20 training sets; vertical bars denote confidence intervals (standard deviations). (B) Codebook utilization corresponding to A. Comparison between FBAR (thick full line), and Lloyd I (thin full line). (C) Average mean square error as a function of k. Comparison in case of the same gaussian but with all values in the interval (-1,l) removed.
A
948
Marc M. Van Hulle and Dominique Martinez
3.1 x (interval 32) to 1.8 x lo-’ (interval 19) whereas for FBAR it is less than 4.2 x lop4 everywhere. Recently, a proof of convergence for Lloyd I was found by Wu (1992). The proof holds only in case the input p.d.f. is continuous, positive and defined on a finite interval. Furthermore, the Lloyd-Max conditions are not sufficient to guarantee the overall optimality of the quantizer (see, e.g., Gersho and Gray 1991). We have shown that BAR converges for any p.d.f. and thus will always maximize entropy (Van Hulle and Martinez, 1993). Hence, Lloyd I cannot do better than BAR in maximizing entropy. However, the reverse does not hold: BAR can do better than Lloyd I in minimizing MSE. To show this, consider the same gaussian but with all values in the interval (-1,l) removed. The resulting MSE plots are given in Figure 3C. The reason for Lloyd I to fail is because of the gap. This can occur in cases of small training sets. Furthermore, the algorithm often fails to converge due to oscillations in the MSE values it produces: 1 out of 8 runs did not converge for k = 32. For the runs that did converge, more than 100 iterations were needed on average. 6 Adaptive Nonuniform A/D Conversion
Recently, much attention has been given to the development of so-called smart sensors (Esteve et al. 1992). In these sensors, analog to digital (A/D) conversion is combined with circuits performing dynamic correction of changes in sensor characteristics and in environmental conditions. A key advantage of the ANN technique derives from its learning capabilities. Hence, A N N s could be useful for building adaptive A/D converters. However, most of the existing neural-based A/D converters are nonadaptive and assume a uniform distribution of sensor signals (Lee and Sheu 1989, 1992; Tank and Hopfield, 1986). In addition, the Hopfield network, on which they are based, is intrinsically unreliable because of multiple stable equilibria and therefore needs additional circuitry to overcome this limitation. Others have adopted a direct mapping approach but the networks proposed are still restricted to uniform input distributions and are also nonadaptive (Michel and Gray 1990; Ogunfunmi and Wadhwa 1992). However, in case BAR is used for changing the quantization process, then one can achieve adaptive nonuniform A/D conversion. Basically, this has been shown in Section 4. Before we outline a possible circuit, we will first consider the effect of long-term drifts in sensor characteristics and transient changes in environmental conditions. To illustrate the effect of long-term drifts, consider the case in which the mean sensor output igradually shifts in time. We assume again the gaussian p.d.f.; the drift is lo-’ x-units per time step t. The result is given in Figure 4A for k = 11 and BAR equation 3.2 with = 0.005. We observe the rule’s ability to adjust itself to gradually changing sensor
Unsupervised Learning Rule for Scalar Quantization
949
B
A
05
0.0 0 5
1.0
1.1 20 2.5
3.0
3 5 4.0 lia*
(X
4 5 1.0 1.066)
0.0 0.0
05
1.0
1.1 20 1 5
3.0 3.1 4.0 4 5 5.0 lh*
( I 1.086)
Figure 4: (A) Effect of a long-term drift in the input characteristic. The input distributionis a gaussian p.d.f. with a, = 0.1 and x[t] = 0.75-2.5 x lo-%; k = 11 and BAR equation 3.2 is used with 9 = 0.005. (8)Effect of a transient change in the input distribution sensed for Fl3AR with r ) = 0.00025. At t = 2.5 x lo6 the input p.d.f. changes from a gaussian with a, = 0.15 and X = 0.5 into a uniform distribution in the interval [O; 0.5); k = 32. For both A and B, the starting values of the boundary points are chosen randomly in the interval [0,1).
characteristics. Extensive simulations revealed that the maximal slope the rule can accommodate without quantization lagging behind given x-unitslt; the maximum slope decreases these k and 7 values is with increasing k. For the FBAR with 77 = 0.00025, the maximal slope is 5x and independent of the number of intervals k. Finally, the effect of a change in environmental conditions can be simulated by a change in the type of pdf. sensed. We consider here a transient change from a gaussian to a uniform p.d.f. The change occurs at t = 2.5 x lo6. The result is depicted in Figure 4B for k = 32 using FBAR. We observe the fast transition from a nonuniform to a uniform quantization of the signal range R: under the given input conditions and 7 value, the transition takes about 40,000 time steps. Finally, we outline the design of a flash A/D converter, so called since it yields an output in a single time step; the development of the neural hardware is the subject of ongoing research. An A/D converter is uniquely determined by its set of quantization intervals and corresponding binary code words. The quantization process can be implemented using k - 1 binary threshold elements (“neurons”)acting as comparators
Marc M. Van Hulle and Dominique Martinez
950
and operating on the scalar input x. The output of the i-th threshold element, Vi, 1 2 i < k, satisfies
As a result of this, the k quantization intervals are converted into k binary code vectors (VllV2,.. . , Vk-1) following the principle of thermometer coding. The thermometer code is then transformed into an M-dimensional binary code (C1 C2 . . . , CM), 2M-' < k 5 2M, using dedicated combinatorial logic. Finally, to perform boundary point adaptation, the act^ values are needed. They are obtained as follows: A c ~ Q ( x=) TVj A Vj-1, 1< i
(6.2)
7 Conclusion
The purpose of this article was not to introduce a rule to minimize average distortion but to maximize information-theoretic entropy. Our rule is completely different from other unsupervised learning rules: rather than finding the centroids of the quantization intervals, BAR directly finds the boundary points that demarcate the quantization intervals. Since it is guaranteed to maximize entropy, BAR always arrives at an equiprobable quantization of the analog signal range and, thus, at a reliable, nonparametric model of the input probability density function. We have shown that BAR is a useful rule by its simplicity, its minor computational requirements, its speed, and its ability to adapt to longterm drifts and transient changes in input characteristics. We have also shown that in terms of minimizing the mean squared error (MSE) for a given number of quantization intervals, BAR can do better or worse than Lloyd I, an algorithm that explicitly attempts to minimize MSE, depending on the input distribution. Hence, we conjecture that maximizing entropy is a good criterion for designing quantizers even in the scalar case. Appendix Consider scalar quantization with the number of intervals k > 1. If one wishes to minimize MSE for a fixed k, then we get the two necessary conditions (Lloyd-Max conditions) by differentiating equation 5.1 with respect to the boundary points di, 1 5 i < k, and with respect to the output levels yi, 1 I i 5 k. The first one corresponds to the nearestneighbor condition; the second one corresponds to the centroid condition.
Unsupervised Learning Rule for Scalar Quantization
951
We now show that entropy maximization is in general not equivalent to minimizing MSE for scalar quantization.
Proof. Assume the inverse and that entropy maximization is in general equivalent to minimizing MSE. We start with the centroid condition. Taking the derivative of equation 5.1 with respect to yi yields Jei ( X - yi)
p ( x ) dx = 0
and, thus,
4-1
yi = kJei
x p ( x ) dx (A.1)
6-1
since we have an equiprobable quantization. Consider two consecutive intervals, Di and Di+l, 1 5 i < k. Using the previous equation, the average of the corresponding output levels equals
+
Yi -yi+l -
2
k J$+'
2 ei-,
x p(x) dx
(A.2)
+
This equation then simply states that (yi yi+1)/2 equals the centroid of intervals Di and Di+l. However, following the nearest-neighbor condition, we must have (yi yi+1)/2 = Oi, 1 5 i < k. The question is, does Bi correspond to the centroid of the intervals D, and Di+l? Since we have an equiprobable quantization, f9i is the median of these intervals. Now since centroids do not coincide with medians in general, the original assumption is false. Hence, entropy maximization is in general not equivalent to MSE minimization. 0
+
Acknowledgments The authors wish to thank Prof. L. Xu, Peking University, Department of Mathematics, and Prof. M. Jordan, Massachusetts Institute of Technology, Department of Brain and Cognitive Sciences, for helpful discussions. The first author is a senior research assistant of the National Fund for Scientific Research (Belgium). He is also supported by a Fulbright-Hays grant-in-aid and a NATO research grant. The second author is supported by an INFUA Postdoctoral Fellowship (France).
References Ahalt, S. C., Krishnamurthy, A. K., Chen, P., and Melton, D. E. 1990. Competitive learning algorithms for vector quantization. Neural Networks 3,277-290. DeSieno, D. 1988. Adding a conscience to competitive learning. In Proc. 2988 Jnternutional Conference on Neural Networks (ICNN-88), San Diego, Vol. I, 117124.
Esthve, D., Baillieu, F., and Delapierre, G. 1992. Integrated silicon-based sensors: Basic research activities in France. Sensors Actuators A, 33, 1-4.
952
Marc M. Van Hulle and Dominique Martinez
Gersho, A., and Gray, R. M. 1991. Vector Quantization and Signal Compression. Kluwer, Boston. Grossberg, S. 1976a. Adaptive pattern classification and universal recoding: I. Parallel development and coding of neural feature detectors. Bid. Cybern. 23,121-134. Grossberg, S. 1976b. Adaptive pattern classification and universal recoding: 11. Feedback, expectation, olfaction, illusions. Biol. Cybern. 23, 187-202. Hertz, J., Krogh, A., and Palmer, R. G. 1991. Introduction to the Theory of Neural Computation. Addison-Wesley, Reading, MA. Kohonen, T. 1989. Self-Organization and Associative Memory. Springer, Berlin. Kohonen, T. 1990. Learning vector quantization and the self-organisingmap. In Theory and Applications of Neural Networks, J. G. Taylor and C. L. T. Mannion, eds., pp. 235-242. Springer, Berlin. Lee, B. W., and Sheu, B. J. 1989. Design of a neural-based A/D converter using modified Hopfield network. IEEE 1. Solid-state Circuits SC-24, 1129-1135. Lee, B. W., and Sheu, B. J. 1992. Design and analysis of analog VLSI neural networks. In Neural Networks for Signal Processing, 8. Kosko, ed., pp. 229286. Englewood Cliffs, Prentice Hall. Linde, Y., Buzo, A., and Gray, R. M. 1980. An algorithm for vector quantizer design. IEEE Trans. Commun. COM-28, 84-95. Max, J. 1960. Quantizing for minimum distortion. IRE Trans. Inform. Theory IT-6, 7-12. Michel, A. N., and Gray, D. L. 1990. Analysis and synthesis of neural networks with lower block triangular interconnecting structure. IEEE Trans. Syst. Circuits 37, 1267-1283. Mokkadem, A. 1989. Estimation of the entropy and information of absolutely continuous random variables. IEEE Trans. Inform. Theory 35, 193-196. Nasrabadi, N. M., and Feng, Y. 1988. Vector quantization of images based upon the Kohonen self-organizing feature maps. In IEEE International Conference on Neural Networks, pp. 1101-1108. IEEE, Sari Diego. Naylor, J., and Li, K. P. 1988. Analysis of a neural network algorithm for vector quantization in speech parameters. In Proceedings of the First Annual INNS Meeting, p. 310. Pergamon Press, New York. Ogunfunmi, A. O., and Wadhwa, S. K. 1992. New architecturesfor the A/D converter application of the Hopfield neural network. In Intelligent Engineering Systems through Artificial Neural Networks (Proc. of the Conf. on Artificial Neural Networks in Engineering, St. Louis, 2992), C . H. Dagli, S. R. T. Kumara, and Y. C. Shin, eds., pp. 47-52. ASME Press Series on International Advances in Design Productivity. Ritter, H., and Schulten, K. 1986. On the stationary state of Kohonen's selforganizing sensory mapping. Biol. Cybem. 54,99-106. Rumelhart, D. E., and Zipser, D. 1985. Feature discovery by competitive learning. Cog. Sci. 9, 75-112. Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Tank, D. W., and Hopfield, J. J. 1986. Simple "neural" optimization networks:
Unsupervised Learning Rule for Scalar Quantization
953
An A/D converter, signal decision circuit, and a linear programming circuit. IEEE Trans. Circuits and Systems 33, 533-541. Van den Bout, D. E., and Miller, T. K. 111 1989. T I N A " : The integer Markovian artificial neural network. In Proc. International Joint Conference on Neural Networks, pp. II205-II211. Erlbaum, Engelwood Cliffs, NJ. Van Hulle, M. M., and Martinez, D. 1993. On a novel unsupervised competitive learning algorithm for scalar quantization. IEEE Transuctwns on Neural Networks, in press. Wu, X. 1992. On convergence of Lloyd's Method I. I E E E Trans. Inform. Theory 38, 171-174. Received 9 October 1992;accepted 4 March 1993.
This article has been cited by: 2. Marc M. Van Hulle . 1998. Kernel-Based Equiprobabilistic Topographic Map FormationKernel-Based Equiprobabilistic Topographic Map Formation. Neural Computation 10:7, 1847-1871. [Abstract] [PDF] [PDF Plus] 3. L.L.H. Andrew. 1996. Comments on "On a novel unsupervised competitive learning algorithm for scalar quantization". IEEE Transactions on Neural Networks 7:1, 254-256. [CrossRef]
Communicated by John Platt
A Function Estimation Approach to Sequential Learning with Neural Networks Visakan Kadirkamanathan' Department of Engineering, University of Cambridge, UK
Mahesan Niranjan Department of Engineering, University of Cambridge, UK
In this paper, we investigate the problem of optimal sequential learning, viewed as a problem of estimating an underlying function sequentially rather than estimating a set of parameters of the neural network. First, we arrive at a suboptimal solution to the sequential estimate that can be mapped by a growing gaussian radial basis function (GaRBF) network. This network adds hidden units for each observation. The function space approach in which the estimates are represented as vectors in a function space is used in developing a growth criterion to limit its growth. A simplification of the criterion leads to two joint criteria on the distance of the present pattern and the existing unit centers in the input space and on the approximation error of the network for the given observation to be satisfied together. This network is similar to the resource allocating network (RAN) (Platt 1991a) and hence RAN can be interpreted from a function space approach to sequential learning. Second, we present an enhancement to the RAN. The RAN either allocates a new unit based on the novelty of an observation or adapts the network parameters by the LMS algorithm. The function space interpretation of the RAN lends itself to an enhancement of the RAN in which the extended Kalman filter (EKF) algorithm is used in place of the LMS algorithm. The performance of the RAN and the enhanced network are compared in the experimental tasks of function approximation and time-series prediction demonstrating the superior performance of the enhanced network with fewer number of hidden units. The approach adopted here has led us toward the minimal network required for a sequential learning problem. 1 Introduction
Artificial neural networks (ANN4 provide an input-output mapping and hence their output can be written as a function of the inputs and of 'Present address: Department of Automatic Control and Systems Engineering, University of Sheffield, UK.
Neural Computation
5,954-975 (1993) @ 1993 Massachusetts Institute of Technology
Sequential Learning with Neural Networks
955
the parameters or weights in the network. Learning in neural networks amounts to the approximation of an underlying function, which in turn reduces to the estimation of the parameters (or weights) that is optimal in some sense such as least squared approximation error. The conventional approaches to sequential learning view the problem in the parameter space, that is, as estimation of a set of parameters. Such an approach requires the complexity or size of the ANN to be specified a priori. We have adopted an alternative approach of estimating a function, in which we view the problem in a function space (infinite dimensional space of square integrable real functions)where different complexity ANN mappings can be represented. As we shall see later, the real advantage is demonstrated by the development of a growing network from this approach. The function estimation approach to sequential learning, in which the task is to estimate an underlying function sequentially, led to the development of the principle of F-Projection (Kadirkamanathan 1991; Kadirkamanathan and Fallside 1990; Kadirkamanathan et al. 1991). The principle is used in deriving a sequential estimate that is mapped by a gaussian radial basis function (GaRBF) network that adds a hidden unit for each observation. This method of estimation may be looked on as a sequential method of kernel-based nonparametric estimation such as the Parzen window density estimation (Duda and Hart 1973). It differs from the statistical approach adopted by White (19901, who specifies the rate of growth of the number of hidden units in the ANN based only on the number of observations. We also demonstrate how the function space approach, where the function estimates are analyzed in the space of all square integrable functions, leads to a criterion to limit the growth of the network. The resulting network is a GaRBF network that adds a hidden unit subject to the present observation satisfying some growth criteria. It is equivalent to the resource allocating network (RAN) (Platt 1991a) but for the manner in which it is derived. Platt (1991a) describes the RAN as a single hidden layer network of locally tuned hidden units whose responses are linearly combined to form an output response. It is essentially a GaRBF network. However, the RAN starts with no hidden units and grows by allocating hidden units based on the "novelty" of an observation. Since the novelty of each observation is tested, it is ideally suited for sequential learning problems such as on-line prediction and control. The objective behind its development is to gradually approach the appropriate complexity of the network that is sufficient to provide an approximation to an underlying mapping that is consistent with the observations being received. The RAN may be viewed as an extension of the restricted Coulomb energy (RCE) model (Reilly et aZ. 1992) of classification to solving the function interpolation problem. When the novelty or the growth criterion is not satisfied, the existing RAN parameters are adapted by the LMS algorithm. While the growth
Visakan Kadirkamanathan and Mahesan Niranjan
956
criterion and the allocation of a new hidden unit can be explained from the function space approach, the adaptation by LMS seems to be a weak step in an otherwise optimal procedure. The RAN can be enhanced by adopting the function space approach in the adaptation stage as well. The enhancement we suggest is to use the extended Kalman filter (Em algorithm in place 'of the LMS algorithm. The performances of the RAN and the enhanced RAN are compared in two experiments. The organization of the paper is as follows: The next section provides a brief description of the preliminary concepts and the notations used in the paper. Section 3 introduces the function space approach to sequential learning. Section 4 develops the growth criterion for the network and in Section 5, Platt's description of the RAN is given along with a discussion on its equivalence to the network derived from the function space approach. Section 6 contains the description of the enhanced RAN. The experimental results are given in Section 7 followed by conclusion in Section 8. 2 Preliminaries and Notations
The ANN learns the mapping from a set of data in the form of inputoutput observation pairs (xn,yn),where xn is an M-dimensional input vector and yn is an output scalar. The input lies in a subset V of the space of all real valued M-dimensional vectors !RM. The nth observation can then be described as I(") = ((xn,yn): xn E
v c W ; y n E !R)
(2.1)
The observations Z("), n = 1,.. . ,N are assumed to be free of noise and be consistent with an underlying function f*, viz.,
f,(xn) = yn
for
n
= 1,.. . ,N
(2.2)
The mapping described by the ANN is denoted by f(x) with a shorthand description of f . Hence, f : x H y(V H %). The closeness between the ANN mapping and the underlying function is measured by some distance metric D(f,f.). A common and popular metric used with A " s is the L2-norm given by
D W * )= Ilf -f*ll where
(2.3)
11 11 denotes the L2-norm. The squared L2-norm' is given by (2.4)
'In general a weighting function w(x) is used inside the integral. If the distribution of the past observations are not known a priori, w(x)is taken to be uniform in V giving the expression in equation 2.4.
Sequential Learning with Neural Networks
957
I
I . gives the absolute value of its argument. Note that the L2-normof an M-dimensional vector a is also denoted by Ilall. The L2-norm describes a function space that contains all the square integrable real valued functions. Since an inner product can also be defined in this space, it is a Hilbert space, denoted by 'H =
u : llfll < .I
(2.5)
The mapping described by an ANN satisfies the above requirement in general and hence all possible functions an ANN can describe lie in the function space N. The inner product between two functions f , g E 'H is given by
(2.6) The concepts from geometry can be applied in a Hilbert space, which leads us to the notion of angle between the two functions f and g, given bY
R =c o d
{ --}llfll(fd) llsll
(2.7)
When the output of the network is not limited to the interval (0, l), the hidden-output layer transformation is linear. Then the single hidden layer ANN linearly combines the output of the hidden units. Each of the hidden units construct a mapping and hence these mappings can.be viewed as the basis functions f$k E 'H, k = 1,. . .,K, the total number of hidden units being K. The output is thus represented as K
In sequential estimation, an estimate is required at each time instant
(n). The ANN map ing after it has learned from the nth observation I(") is denoted by &I. This is known as the posterior estimate of the underlying function and f("-*) as the prior. 3 Sequential Function Estimation
The sequential function estimation problem can be stated as follows: Given the prior estimate f("-') and the new observation I("), how do we combine these in obtaining the posterior estimate f'")? Given only the information above and the assumption that the observations are free of noise, one approach to sequential estimation is to choose an optimal estimate at each step. Any improvement over this solution would require additional memory such as the probability density of the past observations. The step-wise optimal estimate is given by
Visakan Kadirkamanathan and Mahesan Niranjan
958
the principle of F-Projection (Kadirkamanathan and Fallside 1990),which states,
where H, is the set consisting of all the functions in H that satisfy the constraint f(xn)= y,. The posterior is a projection of the prior onto the space H,. The principle is an analogue of the projection algorithm for linear models (Goodwin and Sin 1984), where the prior parameter vector is projected onto the constraint hyperplane in the parameter space. The equality constraint f ( x , ) = yn can be rewritten as an inner product in the function space, (f,6,) = y n
(3.2)
where b, = b(x - x,) is the impulse function? The constrained minimization can be solved exactly to give f(n)
= f(n-1) + en k n
(3.3)
where en is the prediction error, given by
en = y, -f""(xn)
(3.4)
and
(3.5) This solution amounts to adding a spike at the point Xn to f("-')(x) such that f(")(x)goes through the point (xnryn).Such a solution discounts the fact that the underlying function is smooth and an observation has a bearing on its neighborhood in the input space V. Smoothness constraints must then be added to obtain a posterior e~timate.~ The smoothness constraint must be imposed on k, which has the following property: hn(xn) = 1 and hn(x,+a) = 0 for any Jla(l# 0. Smoothing this impulse like function subject to the constraint thatf(")(x,) = yn yields the gaussian RBF +,,/ given by (3.6) *The 6, does not have a finite L2-norm and hence 6, !$ 1-1. However, a function such as a rectangular function that approaches the impulse function in the limit can be used. 31n the case of networks with finite number of hidden units, this exact solution cannot be reached. In fact, the smoothness constraint is implicit in the selection of the basis functions, which in turn ensures that the posterior estimate obtained from the algorithm based on the principle of 7-Projection is sufficiently smooth (Kadirkamanathan et al. 1991).
Sequential Learning with Neural Networks
959
with u, = x, and ofl representing the required smo~thness.~ Now the properties of 4, are &(x,) = 1 and $,(x, + a) + 0 as llall + 00. The parameter o, is the spread of the GaRBF representing its span around x , in the input space. This view is similar to the method ofpotentialfunctions (Duda and Hart 1973) where each observation in the input space contributes to its neighborhood via the potential of a charge placed on the observation, the span signifying the region of influence of the charge. Hence, from the principle of F-Projection and smoothing its solution, we have arrived at the posterior function estimate f'"), given by f(")(x) = f("-')(x)
+ e,+,(x)
(3.7)
Let us use the GaRBF network to map the function estimate f("-I). Assume there are K hidden units (basis functions) in the network that maps f("-I). Then the posterior is given by (3.8) (3.9)
The posterior estimate is mapped by the same GaRBF network with a new hidden unit added and the parameters associated with it are assigned as follows: (3.10) (3.11) (3.12)
Figure 1 shows the architecture of the network in which a hidden unit is added to map f'"). For the moment, we shall assume that we are given the value of 6,. In the next section we shall see how a reasonable value can be assigned. The network we have arrived at grows with each new observation. The observations x, are implicitly stored as the centers of the gaussian hidden units and e, (hence y,) are implicit in their coefficients. This estimate is similar in spirit to the Parzen window density estimation procedure where the number of kernels are the same as the number of observations and are centered on the input observations (Duda and Hart 1973). The difficulty with using this network for estimation is that the network grows indefinitely as the observations are continually received. 4The maximum curvature given by sup 1q5"(x)1 = l/u2 where the double prime denotes the second derivative with respect to x.
Visakan Kadirkamanathan and Mahesan Niranjan
960
INPUT
X
Figure 1: The network architecture of the growing GaRBF network. The dotted lines show the new links formed by the addition of a hidden unit. 4
A Geometric Growth Criterion
In problems where the data are received sequentially, the network may have approximated the underlying function to a sufficient accuracy and may go on to add hidden units that contribute little to the final estimate. Not only does the complexity of the network increase unnecessarily, it adds to the computational burden significantly. Furthermore, if the data were noisy, good estimation of the network parameters requires its number to be much smaller than the data from which they were estimated. This leads us to the question of how the network growth must be limited. In this section, we will derive a growth criteria using a function space approach that leads us to the RAN'S growth criteria. Consider the Hilbert space 31. The network solutions are points in this infinite dimensional space. If the network consists of K hidden units (and hence K basis functions), assuming that the parameters of the basis functions are not adapted, the network solutions lie in a K-dimensional subspace 7fK formed by the K basis functions. Figure 2 gives a threedimensional illustration of the network solutions. The prior estimate f("-I) and the posterior estimate obtained with the existing K basis functions f!"), both lie in 31~.The posterior estimate f(") (given in uation 3.7) is obtained by adding a new basis function 4". Note thatf, is the projection off'") onto 3 - l ~and hence is the closest point
3
to f(") in XK.The distance between f(") and f? is given by Ilf(") - f!"'l. This distance represents a measure of how bad our approximation will be if we do not add a new basis function. Hence, the decision to add a hidden unit can be based on this distance exceeding a threshold. From
Sequential Learning with Neural Networks
961
Figure 2: Three-dimensional illustration of the prior and posterior network solutions in the Hilbert space. the geometry of the network solutions shown in Figure 2, the criterion is stated as
Ilf'"' -f!"'l
= lenl.lldJnllsin(Q) > t
(4.1)
where t is a threshold and Q is the angle formed by the new basis function 4" to the subspace U Kdefined by the K basis functions in f("-'). The norm of the basis function 4,, depends only on the width on. The angle lies between 0 and a/2 and therefore 0 5 sin(Q) 5 1. Note that such an approach can also be adopted for block learning problems in choosing the form of the next basis function to be added. The distance Ilf'") - f!"I' may be evaluated directly, but it is computationally intensive. Hence, the geometric criterion is simplified further as follows: (4.2) (4.3)
assuming that a,,is predetermined in which case the growth criterion depends only on e,, and 0. These criteria are referred to as the prediction error criterion and the angle criterion, respectively. The prediction error criterion checks for the interpolation of the present observation by the network. The angle criterion attempts to assign basis functions that are nearly orthogonal to all other existing basis functions? 5N0two GaRBFs are completely orthogonal to each other, except in the limit of the widths uk approaching 0 or 00. No single GaRBF with unique parameters is a linear combination of any other GaRBFs, except in the limit of infinite number of GaRBFs.
Visakan Kadirkamanathanand Mahesan Niranjan
962
The angle 52 is difficult to evaluate in general and an approximation of the angle criterion is for the smallest angle between the new basis function and all other existing basis functions to exceed a threshold. The angle between the two GaRBFs rpk and 4, with the same width 0 k = 0, = gois given by [in (Kadirkamanathan 199111, (4.4)
The angle criterion then reduces to, sup
4k(xn)
5 cos2(~min)
(4.5)
k
a threshold on the output of the basis functions to the input x,. This can equivalently be expressed as inf llx, - ukll 2 k
6,
(4.6)
a threshold on the distance between the input x , and the nearest GaRBF unit center Uk with €,
= 0oJ2log(l/cos252~)
(4.7)
Even when the widths 0, are not equal, a similar criterion can be arrived at (Kadirkamanathan 1991). From equation 4.4 it is clear that the angle can be increased by lowering 0,. However, lowering 0, increases the curvature of &(x), which in turn gives a less smooth posterior estimate. A good choice for 0, then is for it to be as large as possible but satisfy the angle criterion. From equation 4.5 this turns out to be On
= K l l X n - Unrll
(4.8)
where Unr = arg
min 1 Ix, - Uk I I uk
is the nearest GaRBF unit center to xn in D,and 1 fc= \/210g(l/cos~52min)
(4.9)
(4.10)
If 52- is decreased, allowing more overlap between the two basis functions, IC is increased. The addition of the basis function centered on the input pattern has an analogy to placing marbles inside a restricted space such as a cube. The minimum distance criterion ensures that the marbles are of a particular radius. Irrespective of the distribution of the input patterns there is a limit on the number of such marbles that can be placed inside a finite volume cube. The network now adds a hidden unit only if the prediction error criterion (equation 4.2) and the distance criterion (equation 4.6) are both satisfied. These growth criteria are the same as those for the RAN. What is different is how it is arrived at in this paper, where a function space approach is adopted.
Sequential Learning with Neural Networks
963
5 The Resource Allocating Network
The resource allocating network (RAN) was developed as a means to overcome the problem of NP-completeness in learning with fixed size networks (Platt 1991a). Its motivation was the fact that by allocating new resources, learning could be achieved in polynomial time. Platt views the task of the RAN as combining memorization with adaptation (Platt 1991b), in which memorization is achieved by storing the inputoutput observation such as in Parzen window and k-nearest-neighbor methods. He improves on these methods by storing fewer observations which grow sublinearly and eventually saturate. The RAN finds an appropriate network (or size) for interpolating the given data, whereas in using a fixed size network either a smaller network that does not interpolate well or a larger network that overfits and generalizes poorly could be encountered. The RAN is a single hidden layer network whose output response to an input pattern is a linear combination of the hidden unit responses, given by
where &(x) are the responses of the hidden units to an input x. The coefficients . . . ,ak, . . . are the weights of the hidden to output layer and cro is the bias term. The RAN hidden unit responses are given by
where uk is the unit center or mean of the gaussian and Uk is the spread of the neighborhood or width of the gaussian. The network is essentially a GaRBF network, except for the term QO. Platt describes the operation of a hidden unit as storing a local region in the input space-the neighborhood of Uk. Hence, the uk are viewed as stored patterns. The weights of the hidden-output layer, coefficients (Yk, define the contribution of each hidden unit to a particular output. The network begins with no hidden units. The first observation (%,yo), where yo is the target output, is used in intializing the coefficient (YO = yo. As observations are received the network grows by storing some of them by adding new hidden units. The decision to store an observation (xn,yn)depends on its novelty, for which the following two conditions must be met: (5.3) (5.4) where unris the nearest stored pattern to x, in the input space and E,, emh are thresholds. The first criterion says that the input must be far away
Visakan Kadirkamanathan and Mahesan Niranjan
964
from stored patterns and the second criterion says that the error in the network output to that of the target must be significant. The value emin is chosen to represent the desired accuracy of the network output. The distance en represents the scale of resolution in the input space. When a new hidden unit is added to the network, the parameters or weights associated with this unit are assigned as follows: (5.5) (5.6) (5.7) where K. is an overlap factor that determines the overlap of the responses of the hidden units in the input space. The value for the width O K + ~is based on a nearest-neighbor heuristic. When the observation (xn,yn)does not satisfy the novelty criteria, the LMS algorithm is used to adapt the network parameters w = [ a ~ . . ,. , a ~ , . , . ,u:lT, given by
uT, ,(n)
= W(n-l)
+Vnan
(5.8)
where q is the adaptation step size and an = Vwf(Xn) is the gradient of the function f(x) with respect to the parameter vector w evaluated at w("-'). Hence,
(5.9) The RAN begins with c,, = cmax, the largest scale of interest, typically the size of the entire input space of nonzero probability density. The distance cn is decayed exponentially as cn = max{~maxyn, ernin)
(5.10)
where 0 < y < 1 is a decay constant. The value for E, is decayed until it reaches emin. Platt showed that the complexity of the RAN was smaller than that of fixed size networks in achieving a given degree of approximation (Matt 1991a). The advantages of the RAN are that it learns quickly, accurately, and forms a compact representation. However, the growth pattern of the RAN depends critically on y, which influences the rate of growth and on emin, which determines the final size of the network together with €,,,in. These parameters have to be chosen a priori and hence the performance of the RAN depends crucially on their appropriate selection. The effect of using the LMS algorithm for adaptation is likely to result in slower convergence than if, say, an algorithm that attempts to obtain an optimal sequential estimate is used.
Sequential Learning with Neural Networks
965
The RAN is described mathematically as follows: 0
The network output f(x) to the input x is given by f(x) =
QO
+
K akd)k(X) k=l
The description of the RAN, shown above, has mostly been derived from the function space approach in Sections 3 and 4. There are differences however, between this derivation and the RAN as specified by Platt. First, the term (YO does not appear in our solution. This difference may be neglected in view of the universal approximation properties of the ANN. Secondly and importantly, the threshold on the distance criterion of the RAN, 6 , is reduced gradually until it reaches a minimum allowed value. The distance criterion E, provides a lower bound on the width 6, (from equations 4.6 and 4.8), on > KC,. The lower bound emin on E, then gives, 6k
> % 6-
(5.11)
a lower bound on the width of all the basis functions d)k. This ensures a limit on the smoothness of the basis functions preventing a noisy fit to the data. The exponential decaying of the distance criterion allows fewer basis functions with large widths (smoother basis functions) initially and with increasing number of observations, more basis functions with smaller
966
Visakan Kadirkamanathanand Mahesan Niranjan
widths are allocated to fine tune the approximation. Since K. is fixed, the minimum angle R,h also remains unchanged and hence the near orthogonality condition is maintained. However, the lowering of en is possible only with the simultaneous lowering of on achieved by equation 4.8. Finally, we have not discussed how to adapt the parameters of the network when a hidden unit is not added. The RAN adapts the coefficients (Yk and the hidden unit centers uk when it decides not to add a hidden unit. The adaptation of the parameters uk amounts to the rotation of the subspace 3 ~The . function space interpretation given to the architecture and the growth criteria of the RAN suggests the use of an algorithm based on the same approach, the F-Projections algorithm (Kadirkamanathan 1991; Kadirkamanathan et al. 1991). We shall see in the next section how this leads to an enhancement of the RAN.
6 An Enhanced RAN
For the sequential learning problem, the principle of F-Projection gives the optimal posterior estimate of an underlying function, given its prior estimate and a new observation (Kadirkamanathan 1991). An extension of the F-Projections algorithm is the recursive nonlinear least-squares (RNLS) algorithm (Kadirkamanathan 1991; Kadirkamanathan and Niranjan 1991) in which the distribution of the previous input patterns is also recursively estimated. The RNLS estimate is obtained by minimizing the cost function
+
(6.1)
where p("-')(x) is the probability distribution of the past (n - 1) input observations. Solving the above minimization is numerically intensive, but it can be approximated to give the well-known extended Kalman filter (EKFY algorithm (Kadirkamanathan 1991). Having shown that the principle of F-Projection formed the basis for the RAN, the enhancement we suggest here is to use the EKF algorithm in place of the LMS algorithm. This enhancement, first proposed in Kadirkamanathan (1991) and used in Kadirkamanathan et al. (1992), improves the rate of convergence of the RAN and results in a network with smaller complexity. We shall refer to this enhanced network as RAN-EKF. A similar approach was (independently) proposed in Azimi-Sadjadi and Sheedvash (1991 where they use the RLS algorithm for the growing multilayer perceptron (or backpropagation) network. 6Forlinear stationary models, EKF algorithm is equivalent to the weighted recursive least-squares (RLS) algorithm (Candy 1986).
Sequential Learning with Neural Networks
967
Given a arameter vector w, the EKF algorithm obtains the posterior estimate w("from its prior estimate w(,-I) and its prior error covariance estimate P,-1 as follows [see Candy (198611:
7
w(n)= w(n-1) + en kn
(6.2)
where k, is the Kalman gain vector given by
k, = [Rn+ aiP,-la,]
-1
Pn-lan
(6.3)
where a, is the gradient vector and R, is the variance of the measurement noise. The error covariance matrix is updated by
P, = [I - knai]Pn-1
(6.4)
I being the identity matrix. Note that we are now adapting the parameters q,. . . ,a~ as well and hence these are included in the parameter vector w. The rapid convergence of the EKF algorithm may prevent the model from adapting to future data. To avoid this problem, a random walk model is often used (Young 1984) where the covariance matrix update becomes Pn = [I - knai]Pn-1
+ QoI
(6.5)
The parameter Qo is a scalar that determines the allowed random step in the direction of the gradient vector. The error covariance matrix P, is a P x P positive definite symmetric matrix, where P is the number of parameters being adapted. Whenever a new hidden unit is allocated the dimensionality of P, increases and hence, the new rows and columns must be initialized. Since Pn is an estimate of the error covariance of the parameters, we choose
where Po is an estimate of the uncertainty in the initial values assigned to the parameters, which in our case is also the variance of the observations x, and y,. The dimension of the identity matrix I is equal to the number of new parameters introduced by the addition of a new hidden unit. The above equation combined with the EKF algorithm is used in place of the LMS algorithm in the RAN-EKF. The RAN, with its LMS adaptation, is considerably faster to implement than the relatively more complex (computationally) RAN-EKF. However, the EKF algorithm can be implemented as a fast transversal filter algorithm (Azimi-Sadjadi and Sheedvash 1991) to increase the speed and hence its capability to learn on-line in real-time.
Visakan Kadirkamanathan and Mahesan Niranjan
968
7 Experimental Results
The RAN was developed as a model for solving sequential function interpolation problems. It is also suitable for off-line (or block) learning, where all the data are available en-bloc and presented one by one. Two types of problems are presented to the networks. The first is a synthetic function approximation problem where the data are a randomly sampled data consistent with an underlying function and presented one by one cyclically. The second is a problem of on-line prediction of a chaotic time-series where an underlying function does not exist. The variation of the performance measures are shown with increasing time in order to illustrate the effect of the evolving networks. The RAN used in these experiments did not have the a0 coefficient and hence differed only in the algorithm used for adaptation. The underlying function that needs to be approximated in the first experiment is the Hermite polynomial [used in MacKay (199211, f.(x) = 1.1(1- x
+ 2 2 ) exp {-&I?)
(7.1)
where x E 8.A random sampling of the interval [-4, +4]is used in obtaining the 40 input-output data for the training set. The approximation error is calculated from 200 uniformly sampled data in the same interval. The parameter values used in this experiment are as follows: E,,, = 2.0, €fin = 0.2, y = 0.977, e- = 0.02, K = 0.87, 77 = 0.05,Po = R, = 1.0, and QO = 0.02. The performances of the different RANs are shown in Figure 3. The RAN-EKF is the enhanced network and RAN-NO is a network that adds hidden units according to the growth criteria but does not adapt the parameters when a unit is not added. The growth pattern of all three networks were similar in the initial stages, suggesting that the rate of growth is independent of the adaptation procedure and depends only on the decay rate of E,. In the latter stages though, the faster convergence of the RAN-EKF prevents further addition of hidden units by reducing the interpolation errors below the critical value e-. The approximation error, measured by the root mean squared error (RMSE)(in Figure 3B) clearly illustrates the faster convergence achieved by the enhanced RAN. The final RMSE value of around 0.07 for the RAN was achieved in just 80 iterations by the RAN-EKF with only 6 hidden units compared to the 18 in the RAN. The enhanced network not only was least complex, but was also accurate by nearly an order better than achieved by the RAN. Note also that the performance of RAN-NO was comparable to the RAN, demonstrating the power behind
Figure 3 Facing puge. The performance of the RAN versions on the Hermite function approximation problem. The circles represent the 40 input-output observations.
Sequential Learning with Neural Networks
3
9
=-
-
W -
Oi
L
969
R*Kwo Ubn WEKF
..............................................
t lLl:,kkkL:,i NllmbsofOt*crv.liau (@) Grorvth P8Uan
......... w.m
........ -
.........
-
RAN W.EW
tlwnim hncra Ru(
I
M E W
970
Visakan Kadirkamanathan and Mahesan Niranjan
0 -.-O [email protected]) 0-0 R1JJ.m)(o.~l A -. -A RAN (0.m A-A [email protected] 0 -.-0 RnN.EIQ(0.W) 0-0 RAN-EWF(0.H)
a0010
aim
0.0100
Noise Vari.ncc (a) N e t w e size (hickh units)
Vs Noise Variance
0 ---0RnN.Nop.oz) 0-0 RAN-NO(O.05)
A -. -A RAN (0.02) A-A fWP.051
0 -. -0 RAN-EKF(0.02) 0-0 RAN*EKF(0.05)
awio
00100
Noise Variance @) RMS Approximation Enor Vs Noise Variance
Figure 4 The effect of noise in observations for the approximation problem. The numbers inside the parentheses in the key are the values of emin used. the appropriate allocation of hidden units. The approximation shown in Figure 3c shows that the RAN does not interpolate as smoothly as the RAN-EKF does, also observed in Kadirkamanathan (1991). The effects of noise were analyzed by adding varying levels of gaussian noise to the training patterns. The size of the network and the
Sequential Learning with Neural Networks
971
approximation error for different noise levels for the RAN versions are shown in Figure 4. It is apparent from the figure that the value of emi, at the lower ranges such as 0.02 and 0.05, chosen to represent the desired accuaracy, does not affect the goodness of approximation for any level of noise. At low levels of noise, a higher threshold e- results in a smaller size network, the difference disappearing with increasing levels of noise. The RAN-EKF was able to consistently form a compact size than RAN while achieving better performance. At high noise levels, adapting the means uk seemed not only to increase the size of the network but also resulted in poor approximation, demonstrated by the performance of RAN-NO. In the second experiment, the chaotic time-series being predicted is the Mackey-Glass series, which is governed by the following differential delay equation:
-ds(t) dt
-
-bs(t) + a
s(t - r )
1 + s(t - r)10
(7.2)
with a = 0.2, b = 0.1, and r = 17. A sampled version of the above series is obtained by integrating the equation, smoothing and sampling it. The training data is obtained from a series of 5000 samples and the test data of 1000 samples from the same series but into the future. The series is predicted v = 50 sample steps ahead using four past samples, namely s,, Sn-6, ~ ~ - 1 ~2",- 1 8 . Hence, the nth input-output data for the network to learn are (7.3)
(7.4) whereas the step ahead predicted value at time ( n ) is given by z,+u
= f'"'(Xn+u)
(7.5)
where f ( " ) is the network at time ( n ) . The step ahead prediction error is given by En+u = Sn+u
- zn+v
(7.6)
The error in the predicted value of s+,, becomes available to the network only after v steps. The data are received sequentially and hence require the storage of all samples from Sn-u-18 to s, and from z, to z,+~. The usual performance measure given for prediction has been the normalized RMSE on the test data (Platt 1991a). For this problem, we have used the normalized RMSE as the approximation performance measure, where the normalization is with respect to the variance of the time-series. This measures the error in an underlying mapping that may exist between x and y or the average prediction error if the network does not adapt to future data. If such a mapping does not exist, this measure does not represent the prediction performance since the network continually adapts
Visakan Kadirkamanathanand Mahesan Niranjan
972
to incoming data and has the capacity to modify its mapping appropriately. A suitable measure is the exponentially weighted prediction error ( W E ) which at time (n)can be recursively computed as WPE(")2= X WPE(n-')2+ (1 - A)
(€,I2
(7.7)
for some 0 < X < 1. In the results shown X = 0.95. The parameter values used in this experiment are as follows: emax = 0.7,Emin = 0.07, 7 = 0.999, emin= 0.05, K = 0.87, = 0.02, Po = R, = 1.0, and Qo = 0.0002. The perforhance of the RAN and the enhanced network are shown in Figure 5. The growth pattern of the RAN and the RAN-EKF showed similar patterns to those observed in the first experiment. The growth rates were initially similar and the RAN-EKF leveled off quickly in the latter stages. The on-line prediction performance measured by the WPE shows that the RAN-EKF is consistently better than the RAN. On average, the WPE for the RAN-EKF was a factor 2 smaller than that of the RAN. The final value of WPE for RAN and RAN-EKF are 0.035 and 0.024, respectively. The normalized RMSE, which measures the average prediction performance on a test data, also shows the superiority of the RAN-EKF even when its complexity is smaller than that of the RAN. The final value for the normalized RMSE of 0.056 for RAN-EKF with 56 hidden units compares well with 0.063 for the RAN. However, such a number can be misleading since it depends on choosing the appropriate time in the oscillatory performance measure as shown in Figure 5. What the results demonstrate is that the enhanced network performed better in both the WPE and the RMSE, than the RAN with fewer hidden units. Platt had already shown that the RAN used fewer hidden units than fixed size networks while achieving better performance (Platt 1991a). The enhancement suggested here has led us toward the minimal network that would be required in a sequential learning problem. 8 Conclusion
We have adopted the function estimation approach to optimal sequential learning with neural networks-a natural framework for approximation with ANN. The suboptimal solution that resulted from this approach was shown to be equivalent to that of the RAN.Thus this work also provides an alternative interpretation to the RAN in contrast to that of Platt whose description was based on the RAN'S similarity to the Parzen window and nearest-neighbor methods. We showed that the
Figure 5 Facing page. The performance of the RAN versions on the MackeyGlass time-series prediction problem.
Sequential Learning with Neural Networks
973
974
Visakan Kadirkamanathan and Mahesan Niranjan
new basis functions (hidden units) are assigned only if they form a significant angle in the function space to the existing basis functions. This ensures the efficient use of the basis functions and their allocation only according to the complexity of the underlying function to be mapped. We have also proposed an enhancement to the RAN as a result of the function space interpretation given to its architecture and the growth criteria. This enhanced network uses the EKF algorithm to adapt the parameters when a hidden unit is not added. The superior performance of the enhanced network over the RAN was demonstrated in function approximation and time-series prediction tasks. The resulting network complexity is smaller while the approximation and prediction accuracy is higher. The initial convergence to the solution was also faster with the enhanced network. Results on a multiclass classification problem are given in Kadirkamanathan and Niranjan (19921, which further reinforces the above conclusion. This achievement is at the expense of added computational complexity, which may be compensated either with systolic array or fast transversal filter implementations. The optimal sequential learning approach has taken us further toward attaining the minimal network that would be required for a given problem.
References Azimi-Sadjadi, M. R., and Sheedvash, S. 1991. Recursive node creation in backpropagation neural networks using orthogonal projection method. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Toronto. Candy, J. V. 1986. Signal Processing: The Model-Based Approach. McCraw-Hill, New York. Duda, R. O., and Hart, P. E. 1973. Pattern Classificationand Scene Analysis. John Wiley, New York. Goodwin, and Sin, K. S. 1984. Adaptive Filtering Prediction and Control. Prentice Hall, Englewood Cliffs, NJ. Kadirkamanathan,V. 1991. Sequential Learning in Artificial Neural Networks. Ph.D. Thesis, Cambridge University Engineering Department. Kadirkamanathan, V., and Fallside, F. 1990. F-Projection: A nonlinear recursive estimation algorithm for neural networks. Techn. Rep. CUED/FINFENG/TR.53, Cambridge University Engineering Department. Kadirkamanathan, V., and Niranjan, M. 1991. Nonlinear adaptive filtering in nonstationary environments. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Toronto. Kadirkamanathan, V., and Niranjan, M. 1992. Application of an architecturally dynamic network for speech pattern classification. Proceedings of the Institute of Acoustics, Vol. 14. Kadirkamanathan, V., Niranjan, M., and Fallside, F. 1991. Sequential adaptation of radial basis function neural networks and its application to time-
Sequential Learning with Neural Networks
975
series prediction. In Neural lnformation Processing Systems 3, R. P. Lippmann, J. E. Moody and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Kadirkamanathan, V., Niranjan, M., and Fallside, F. 1992. Models of dynamic complexity for time-series prediction. In Proceedings of the lnternational Conference on Acoustics, Speech and Signal Processing, San Francisco. MacKay, D. J. C. 1992. Bayesian interpolation. Neural Cornp. 4(3), 415-447. Platt, J. C. 1991a. A resource allocating network for function interpolation. Neural Comp. 3(2), 213-225. Platt, J. C. 1991b. Learning by combining memorization and gradient descent. In Neural lnformation Processing Systems 3, R. I? Lippmann, J. E. Moody and D. S. Touretzky, eds. Morgan Kaufmann, San Mateo, CA. Reilly, D. L., Cooper, L. N., and Elbaum, C. 1982. A neural model for category learning. Biol. Cybernet. 45, 35-41. White, H. 1990. Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks 3,535-544. Young, P. C. 1984. Recursive Estimation and Time-Series Analysis. Springer-Verlag, Berlin. Received 19 October 1992; accepted 10 February 1993.
This article has been cited by: 2. Abdelhamid Bouchachia. 2010. An evolving classification cascade with self-learning. Evolving Systems . [CrossRef] 3. Karim Salahshoor, Amin Sabet Kamalabady. 2010. Online identification of nonlinear multivariable processes using self-generating RBF neural networks. Asian Journal of Control 12:5, 626-639. [CrossRef] 4. Di Wang, Xiao-Jun Zeng, John A. Keane. 2010. A structure evolving learning method for fuzzy systems. Evolving Systems 1:2, 83-95. [CrossRef] 5. Ihab Samy, Ian Postlethwaite, Da-Wei Gu, John Green. 2010. Neural-Network-Based Flush Air Data Sensing System Demonstrated on a Mini Air Vehicle. Journal of Aircraft 47:1, 18-31. [CrossRef] 6. M. Bortman, M. Aladjem. 2009. A Growing and Pruning Method for Radial Basis Function Networks. IEEE Transactions on Neural Networks 20:6, 1039-1045. [CrossRef] 7. Ihab Samy, Ian Postlethwaite, Dawei Gu. 2009. Subsonic Tests of a Flush Air Data Sensing System Applied to a Fixed-Wing Micro Air Vehicle. Journal of Intelligent and Robotic Systems 54:1-3, 275-295. [CrossRef] 8. Mohamed S. Kamel, Youshen Xia. 2009. Cooperative recurrent modular neural networks for constrained optimization: a survey of models and applications. Cognitive Neurodynamics 3:1, 47-81. [CrossRef] 9. Myoung Soo Park, Jin Young Choi. 2009. Evolving Logic Networks With Real-Valued Inputs for Fast Incremental Learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39:1, 254-267. [CrossRef] 10. Keem Siah Yap, Chee Peng Lim, I.Z. Abidi. 2008. A Hybrid ART-GRNN Online Learning Neural Network With a $\varepsilon$ -Insensitive Loss Function. IEEE Transactions on Neural Networks 19:9, 1641-1646. [CrossRef] 11. Ning Jin, Derong Liu. 2008. Wavelet Basis Function Neural Networks for Sequential Learning. IEEE Transactions on Neural Networks 19:3, 523-528. [CrossRef] 12. Jianming Lian, Yonggon Lee, S.D. Sudhoff, S.H. Zak. 2008. Self-Organizing Radial Basis Function Network for Real-Time Approximation of Continuous-Time Dynamical Systems. IEEE Transactions on Neural Networks 19:3, 460-474. [CrossRef] 13. Youshen Xia, Mohamed S. Kamel. 2007. A Measurement Fusion Method for Nonlinear System Identification Using a Cooperative Learning AlgorithmA Measurement Fusion Method for Nonlinear System Identification Using a Cooperative Learning Algorithm. Neural Computation 19:6, 1589-1632. [Abstract] [PDF] [PDF Plus]
14. Jie Ni, Qing Song. 2007. Pruning Based Robust Backpropagation Training Algorithm for RBF Network Tracking Controller. Journal of Intelligent and Robotic Systems 48:3, 375-396. [CrossRef] 15. Puneet Singla, Kamesh Subbarao, John L. Junkins. 2007. Direction-Dependent Learning Approach for Radial Basis Function Networks. IEEE Transactions on Neural Networks 18:1, 203-222. [CrossRef] 16. Arta A. Jamshidi, Michael J. Kirby. 2007. Towards a Black Box Algorithm for Nonlinear Function Approximation over High-Dimensional Domains. SIAM Journal on Scientific Computing 29:3, 941. [CrossRef] 17. Kyosuke Nishida, Koichiro Yamauchi, Takashi Omori. 2006. An online learning algorithm with dimension selection using minimal hyper basis function networks. Systems and Computers in Japan 37:11, 11-21. [CrossRef] 18. C. Constantinopoulos, A. Likas. 2006. An Incremental Training Method for the Probabilistic RBF Network. IEEE Transactions on Neural Networks 17:4, 966-974. [CrossRef] 19. G.-B. Huang, L. Chen, C.-K. Siew. 2006. Universal Approximation Using Incremental Constructive Feedforward Networks With Random Hidden Nodes. IEEE Transactions on Neural Networks 17:4, 879-892. [CrossRef] 20. Gang Leng, Thomas Martin McGinnity, Girijesh Prasad. 2006. Design for Self-Organizing Fuzzy Neural Networks Based on Genetic Algorithms. IEEE Transactions on Fuzzy Systems 14:6, 755-766. [CrossRef] 21. Nan-Ying Liang, Guang-Bin Huang, P. Saratchandran, N. Sundararajan. 2006. A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks. IEEE Transactions on Neural Networks 17:6, 1411-1423. [CrossRef] 22. D.-L. Yu, T.K. Chang, D.-W. Yu. 2005. Fault Tolerant Control of Multivariable Processes Using Auto-Tuning PID Controller. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 35:1, 32-43. [CrossRef] 23. G.-B. Huang, P. Saratchandran, N. Sundararajan. 2005. A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Transactions on Neural Networks 16:1, 57-67. [CrossRef] 24. 2004. A Study on the Bayesian Recurrent Neural Network for Time Series Prediction. Journal of Institute of Control, Robotics and Systems 10:12, 1295-1304. [CrossRef] 25. G.-B. Huang, P. Saratchandran, N. Sundararajan. 2004. An Efficient Sequential Learning Algorithm for Growing and Pruning RBF (GAP-RBF) Networks. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 34:6, 2284-2292. [CrossRef] 26. Feng Jiu-Chao, Qiu Yu-Hui. 2004. Identification of Chaotic Systems with Application to Chaotic Communication. Chinese Physics Letters 21:2, 250-253. [CrossRef]
27. Jau-Jia Guo, P.B. Luh. 2003. Selecting input factors for clusters of gaussian radial basis function networks to improve market clearing price prediction. IEEE Transactions on Power Systems 18:2, 665-672. [CrossRef] 28. Chee Peng Lim, R.F. Harrison. 2003. Online pattern classification with multiple neural network systems: an experimental study. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 33:2, 235-247. [CrossRef] 29. C. Panchapakesan, M. Palaniswami, D. Ralph, C. Manzie. 2002. Effects of moving the center's in an RBF network. IEEE Transactions on Neural Networks 13:6, 1299-1307. [CrossRef] 30. Christophe Andrieu , Nando de Freitas , Arnaud Doucet . 2001. Robust Full Bayesian Learning for Radial Basis NetworksRobust Full Bayesian Learning for Radial Basis Networks. Neural Computation 13:10, 2359-2407. [Abstract] [PDF] [PDF Plus] 31. Jiu-chao Feng, Chi Tse. 2001. On-line adaptive chaotic demodulator based on radial-basis-function neural networks. Physical Review E 63:2. . [CrossRef] 32. T.W.S. Chow, Jiu-Chao Feng, K.T. Ng. 2000. An adaptive demodulator for the chaotic modulation communication system with RBF neural network. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 47:6, 902-909. [CrossRef] 33. Shiqian Wu, Meng Joo Er. 2000. Dynamic fuzzy neural networks-a novel approach to function approximation. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics) 30:2, 358-364. [CrossRef] 34. Y. Li, N. Sundararajan, P. Saratchandran. 2000. Analysis of minimal radial basis function network algorithm for real-time identification of nonlinear dynamic systems. IEE Proceedings - Control Theory and Applications 147:4, 476. [CrossRef] 35. P. Chandra Kumar, P. Saratchandran, N. Sundararajan. 2000. Minimal radial basis function neural networks for nonlinear channel equalisation. IEE Proceedings - Vision, Image, and Signal Processing 147:5, 428. [CrossRef] 36. G.P. Liu, V. Kadirkamanathan, S.A. Billings. 1999. Variable neural networks for adaptive control of nonlinear systems. IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 29:1, 34-43. [CrossRef] 37. P. Yee, S. Haykin. 1999. A dynamic regularized radial basis function network for nonlinear, nonstationary time series prediction. IEEE Transactions on Signal Processing 47:9, 2503-2521. [CrossRef] 38. Lu Yingwei, N. Sundararajan, P. Saratchandran. 1998. Performance evaluation of a sequential minimal radial basis function (RBF) neural network learning algorithm. IEEE Transactions on Neural Networks 9:2, 308-318. [CrossRef] 39. Tin-Yau Kwok, Dit-Yan Yeung. 1997. Constructive algorithms for structure learning in feedforward neural networks for regression problems. IEEE Transactions on Neural Networks 8:3, 630-645. [CrossRef] 40. Lu Yingwei , N. Sundararajan , P. Saratchandran . 1997. A Sequential Learning Scheme for Function Approximation Using Minimal Radial Basis Function
Neural NetworksA Sequential Learning Scheme for Function Approximation Using Minimal Radial Basis Function Neural Networks. Neural Computation 9:2, 461-478. [Abstract] [PDF] [PDF Plus] 41. L. Yingwei, N. Sundararajan, P. Saratchandran. 1997. Identification of time-varying nonlinear systems using minimal radial basis function neural networks. IEE Proceedings - Control Theory and Applications 144:2, 202. [CrossRef] 42. Christophe Molina, Mahesan Niranjan. 1996. Pruning with Replacement on Limited Resource Allocating Networks by F-ProjectionsPruning with Replacement on Limited Resource Allocating Networks by F-Projections. Neural Computation 8:4, 855-868. [Abstract] [PDF] [PDF Plus] 43. S. Fabri, V. Kadirkamanathan. 1996. Dynamic structure neural networks for stable adaptive control of nonlinear systems. IEEE Transactions on Neural Networks 7:5, 1151-1167. [CrossRef] 44. Mark J. L. Orr. 1995. Regularization in the Selection of Radial Basis Function CentersRegularization in the Selection of Radial Basis Function Centers. Neural Computation 7:3, 606-623. [Abstract] [PDF] [PDF Plus]
Communicated by C. Lee Giles
Learning Finite State Machines With Self-clustering Recurrent Networks Zheng Zeng Rodney M.Goodman Department of Electrical Engineering, 116-81, California Institute of Technology, Pasadena, C A 92125 USA
Padhraic Smyth Jet Propulsion Laboratory, 238-420, California Institute of Technology Pasadena, C A 91209 U S A
Recent work has shown that recurrent neural networks have the ability to learn finite state automata from examples. In particular, networks using second-order units have been successful at this task. In studying the performance and learning behavior of such networks we have found that the second-order network model attempts to form clusters in activation space as its internal representation of states. However, these learned states become unstable as longer and longer test input strings are presented to the network. In essence, the network "forgets" where the individual states are in activation space. In this paper we propose a new method to force such a network to learn stable states by introducing discretization into the network and using a pseudo-gradient learning rule to perform training. The essence of the learning rule is that in doing gradient descent, it makes use of the gradient of a sigmoid function as a heuristic hint in place of that of the hard-limiting function, while still using the discretized value in the'tfeedback update path. The new structure uses isolated points in activation space instead of vague clusters as its internal representation of states. It is shown to have similar capabilities in learning finite state automata as the original network, but without the instability problem. The proposed pseudo-gradient learning rule may also be used as a basis for training other types of networks that have hard-limiting threshold activation functions. 1 Introduction Theoretical aspects of grammatical inference have been studied extensively in the past (Angluin 1972,1978; Gold 1972, 1978). A variety of direct search algorithms have been proposed for learning grammars from positive and negative examples (strings) (Angluin and Smith 1983; Fu Neural Computation 5, 976990 (1993)
@ 1993 Massachusetts Institute of Technology
Learning Finite State Machines
977
1982; Muggleton 1990; Tomita 1982). More recently recurrent neural networks have been investigated as an alternative method for learning simple grammars (Cleeremans et al. 1989; Elman 1990, 1991; Fahlman 1990; Giles et al. 1990, 1992; Jordan 1986; Pollack 1991; Rumelhart et al. 1986; Williams and Zipser 1989). A variety of network architectures and learning rules have been proposed. All have shown the capability of recurrent networks to learn different types of simple grammars from examples. In this paper we restrict the focus to studying a recurrent network's behavior in learning regular grammars, which are the simplest type of grammar in the Chomsky hierarchy, and have a one-to-one correspondence to finite state machines (Hopcroft and Ullman 1979). A regular language can be defined as the language accepted by its corresponding finite state acceptor: (C,T, to, 6, F ) , where C is the input alphabet.
T is a finite nonempty set of states.
to is the start (or initial) state, an element of T. 6 is the state transition function; 6 : T x C + T.
F is the set of final (or accepting) states, a (possibly empty) subset of T. The purpose of the study is to obtain a better understanding of recurrent neural networks, their behavior in learning, and their internal representations, which in turn may give us more insight into their capability for fulfilling other more complicated tasks. Giles et al. (1990,1992) have proposed a "second-order" recurrent network structure to learn regular languages. Henceforth, all references to second-order recurrent networks imply the network structure described in Giles et al. (1990,1992). Our independent experiments have confirmed their results that second-order nets can learn various grammars well. In addition, we found that this structure learns these grammars more easily than the simple recurrent network structure (or the Elman structure) (Elman 1991) which does not use second-order units. However, a stability problem emerges with trained networks as longer and longer input strings are presented [similar behavior in recurrent networks has been found in different contexts (Pollack 1991; Servan-Schreiber et al. 199111. In our experiments, the problem appears in 14 out of 15 of the trained networks on nontrivial machines for long strings. A string can be misclassified as early as when its length is only 30% longer than the maximum training length in some of the experiments. The stability problem led us to look deeper into the internal representation of states in such a network and the following interesting behavior was observed: during learning, the network attempts to form clusters in hidden unit space as its representation of states. This behavior occurred in all the learning experiments we performed. Once formed, the clusters are stable for short strings, i.e.,
Zheng Zeng, Rodney Goodman, and Padhraic Smyth
978
strings with lengths not much longer than the maximum length of training strings. However, in 14 out of 15 learned networks, when sufficiently long strings are presented for testing, the clusters (states) start to merge and ultimateIy become indistinguishable. (Details of these experiments will be explained in Section 2.) To solve this problem we propose a discretized combined network structure, as well as a pseudo-gradient learning method, which can be shown to successfully learn stable state representations. In the proposed network, instead of clusters, the states of the network are actually isolated points in hidden unit activation space. 2 The "Unstable State" Behavior of a Learned Second-Order Net
-
We found that the second-ordej network can be represented as two separate networks controlled by a gating switch (Fig. 1) as follows: the network consists of two first-order networks with shared hidden units. The common hidden unit values are copied back to both net0 and netl after each time step, and the input stream acts like a switching control to enable or disable one of the two nets. For example, when the current input is 0, net0 is enabled while netl is disabled. The hidden unit values are then decided by the hidden unit values from the previous time step weighted by the weights in net0. The hidden unit activation function is the standard sigmoid function, f ( x ) = 1/(1 + e-'). Note that this representation of a second-order network, as two networks with a gating function, provides insight into the nature of second-order nets, i.e., clearly they have greater representational power than a single simple recurrent network, given the same number of hidden units. This structure was used in our initial experiments. We use Sy to denote the activation value of hidden unit number i at time step t . wi is the weight from layer 1 node j to layer 2 node i in netn. n = 0 or 1 in the case of binary inputs. Hidden node Sk is chosen to be a special indicator node, whose activation should be close to 1 at the end of a legal string, or close to 0 otherwise. At time f = 0, initialize S: to be 1 and all other Sps to be 0, i.e., assume that the null string is a legal string. The network weights are initialized randomly with a uniform distribution between -1 and 1. In the experiments described here we used the following grammars: Tomita grammars (Tomita 1982): #1-l*. o #&any string not containing "000" as a substring. o #5-even number of 0s and even number of Is. o #7--0'1*0*1*.
0
Simple vending machine (Carroll and Long 1989): The machine takes in three types of coins: nickel, dime, and quarter. Starting
Learning Finite State Machines
979
1
Figure 1: Equivalent First-order structure of second-order network. from empty, a string of coins is entered into the machine. The machine “accepts,” i.e., a candy bar may be selected, only if the total amount of money entered exceeds 30 cents. A training set consists of randomly chosen variable length strings with length uniformly distributed between 1 and Lmax, where Lmax is the maximum training string length. Each string is marked as “legal” or ”illegal” according to the underlying grammar. The learning procedure is a gradient descent method in weight space [similar to that proposed by Williams and Zipser (198911 to minimize the error at the indicator node for each training string (Giles et al. 1992). In a manner different from that described in Giles et al. (1992), we present the whole training set (which consists of 100 to 300 strings with Lmax in the range of 10 to 201, all at once to the network for learning,
980
Zheng Zeng, Rodney Goodman, and Padhraic Smyth
so
so
so
so
Figure 2 Hidden unit activation plot So-S3 in learning Tomita grammar #4. (SO is the x axis.) (a)-(e) are plots of all activations on the training data set. (a) During 1st epoch of training. (b)During 16th epoch of training. (c) During 21st epoch of training. (d) During 31st epoch of training. (e) After 52 epochs, training succeeds, weights are fixed. (0 After training, when tested on a set of maximum length 50. instead of presenting a portion of it in the beginning and gradually augmenting it as training proceeds. Also, we did not add any end symbol to the alphabet as in Giles et al. (1992). We found that the network can successfully learn the machines (2-10 states) we tested on, with a small number of hidden units (4-5) and less than 500 epochs, agreeing with the results described in Giles et al. (1992). To examine how the network forms its internal representation of states, we recorded the hidden unit activations at every time step of every training string in different training epochs. As a typical example, shown in Figure Za-e, are the So-S3 activation-space records of the learning process of a 4-hidden-unit network. The underlying grammar
Learning Finite State Machines
981
was Tomita #4, and the training set consisted of 100 random strings with L,,, = 15. Note that here the dimension SO is chosen because of it being the important “indicator node,” and 53 is chosen arbitrarily. The observations that follow can be made from any of the 2-D plots from any run in learning any of the grammars in the experiments. Each point corresponds to the activation pattern of a certain time step in a certain string. Each plot contains the activation points of all time steps for all training strings in a certain training epoch as described in the caption. The following behavior can be observed: 1. As learning takes place, the activation points seem to be pulled in several different directions, and distinct clusters gradually appear (Fig. 2a-e).
2. After learning is complete, i.e., when the error on each of the training strings is below a certain tolerance level, the activation points form distinct clusters, which consist of segments of curves (Fig. 2e). 3. Note in particular that there exists a clear gap between the clusters in the So (indicator) dimension, which means that the network is making unambiguous decisions for all the training strings and each of their prefix strings (Fig. 2e). 4. When given a string, the activation point of the network jumps from
cluster to cluster as input bits are read in one by one. Hence, the behavior of the network looks just like a state machine’s behavior. It is clear that the network attempts to form clusters in activation space as its own representation of states and is successful in doing so. Motivated by these observations, we applied the k-means clustering algorithm to the activation record in activation space of the trained network to extract the states [instead of simply dividing up the space evenly as in Giles et al. (199211. In choosing the parameter k, we found that if k was chosen too small, the extracted machine sometimes could not classify all the training strings correctly, while a large k always guaranteed perfect performance on training data. Hence, k was chosen to be a large number, for example, 20. The initial seeds were chosen randomly. We then defined each cluster found by the k-means algorithm to be a “state” of the network and used the center of each cluster as a representative of the state. The transition rules for the resulting state machine are calculated by setting the nodes equal to a cluster center, then applying an input bit (0 or 1 in binary alphabet case), and calculating the value of the S: nodes. The transition from the current state given the input bit is then to the state that has a center closest in Euclidean distance to the obtained St values. In all our experiments, the resulting machines were several states larger than the correct underlying minimal machines. Moore’s state machine reduction algorithm was then applied to the originally extracted machine
982
Zheng Zeng, Rodney Goodman, and Padhraic Smyth
to get an equivalent minimal machine which accepts the same language but with the fewest possible number of states. Similar to the results in G i l a et al. (1992), we were able to extract machines that are equivalent to the minimal machines corresponding to the underlying grammars from which the data was generated. These trained networks perform well in classifying unseen short strings (not much longer than Lmax). However, as longer and longer strings are presented to the network, the percentage of strings correctly classified drops substantially. Shown in Figure 2f is the recorded activation points for So-& of the same trained network from Figure 2e when long strings are presented. The original net was trained on 100 strings with Lmax = 15, whereas the maximum length of the test strings in Figure 2e was 50. Activation points at all time steps for all test strings are shown. Several observations can be made from Figure 2e: 1. The well-separated clusters formed during training begin to merge
together for longer and longer strings and eventually become indistinguishable. These points in the center of Figure 2e correspond to activations at time steps longer than Lma, = 15. 2. The gap in the SO dimension disappears, which means that the network could not make hard decisions on long strings.
3. The activation points of a string stay in the original clusters for short strings and start to diverge from them when strings become longer and longer. The diverging trajectories of the points form curves with sigmoidal shape. Similar behavior was observed for 14 out of 15 of the networks successfully trained on different machines, excluding the vending machine model. Some of the networks started to misclassify as early as when the input strings were only 30% longer than Lmx. Each of these 14 trained networks made classification errors on randomly generated test sets with maximum string length no longer than 5Lmax. The remaining one network was able to maintain a stable representation of states for very long strings (up to length 1000). Note that the vending machine was excluded because it is a trivial case for long strings, i.e., all the long strings are legal strings so there is no need to distinguish between them. This is not the case for the other machines. 3 A Network That Can Form Stable States
From the above experiments it is clear that even though the network is successful in forming clusters as its state representation during training, it often has difficulty in creating stable clusters, i.e., to form clusters such that the activation points for long strings converge to certain centers of
Learning Finite State Machines
983
each cluster, instead of diverging as observed in our experiments. The problem can be considered as inherent to the structure of the network where it uses analog values to represent states, while the states in the underlying state machine are actually discrete. One intuitive suggestion to fix the problem is to replace the analog sigmoid activation function in the hidden units with a threshold function: 1.0 if x 2 0.5 D ( x )= 0.0 if x < 0.5. In this manner, once the network is trained, its representation of states (i.e., activation pattern of hidden units) will be stable and the activation points will not diverge from these state representations once they are formed. However, there is no known method to train such a network, since one cannot take the gradient of such activation functions. An alternative approach would be to train the original second-order network as described earlier, but to add the discretization function D ( x ) on the copy back links during testing. The problem with this method is that one does not know a priori where the formed clusters from training will be. Hence, one does not have good discretization values to threshold the analog values in order for the discretized activations to be reset to a cluster center. Experimental results have confirmed this prediction. For example, after adding the discretization, the modified network cannot even correctly classify the training set that it has successfully learned in training. As in the previous example, after training and without the discretization, the network's classification rate on the training set was loo%, while with the discretization added, the rate became 85%. For test sets of longer strings, the rates with discretization were even worse. We propose that the discretization be included in both training and testing in the following manner: Figure 3 shows the structure of the network with discretization added. From the formulas below, one can clearly see that in operational mode, that is, when testing, the network is equivalent to a network with discretization only: hf = f(Cw$-'), Vi, t , i
s;
= D(h;),
=
Do
where D ( x ) =
icd.'S!-' j "
1
)
,
if { 0.8 0.2 if
x 2 0.5
x < 0.5,
where Do(x)=
(Here xt is the input bit at time step t. We use hf to denote the analog value of hidden unit i at time step t, and Sl the discretized value of hidden unit i at time step t.)
Zheng Zen& Rodney Goodman, and Padhraic Smyth
984
i I
I I I I I I
I
I I I
I I I I I I I I I I
I I I I I I I I
I I I I I I I I I
I
-I I
I I I I I I I I I I
Figure 3 A combined network with discretizations. Hence, the sigmoid nodes can be eliminated in testing to simplify computation. During training, however, the gradient of the soft sigmoid function is made use of in a pseudo-gradient method for updating the weights. The next section explains the method in more detail. By adding these discretizations into the network, one might argue that the capacity of the net is greatly reduced, since each node can now take on only two distinct values, as opposed to infinitely many values (at least in theory) in the case of the undiscretized networks. However, in the case of learning discrete state machines, the argument depends on the definition of the capacity of the analog network. In our experiments, 14 of 15 of the learned networks have unstable behavior for nontrivial long strings, so one can say that the capabilities of such networks to distinguish different states may start high, but deteriorate over time, and would eventually become zero.
Learning Finite State Machines
985
4 The Pseudo-Gradient Learning Method
During training, at the end of each string: 9, xl, ...,xL, the mean squared error is calculated as follows (note that L is the string length, h/j is the analog indicator value at the end of the string): 1 2
E = -(I&
- T)',
where
T = target =
{
1 if "legal" if "illegal".
Update wi,the weight from node j to node i in netn, at the end of each string presentation:
*
-=
&!
11
ahL
(h; , . )T &! 11
where 8/&! is what we call the "pseudo-gradient'' with respect to wt. To get the pseudo-gradient %i/aw;, pseudo-gradients %:law; t,k need to be calculated forward in time at each time step:
-0
for all
(Initially, set: ah,/aw; = 0, Vi, j , n , k) As can be seen clearly, in carrying out the chain rule for the gradient we replace the real gradient aSf-'/&i, which is zero almost everywhere, by the pseudo-gradient The justification of the use of the pseudo-gradient is as follows: suppose we are standing on one side of the hard threshold function S(x), at point xo > 0, and we wish to go downhill. The real gradient of S ( x ) would not give us any information, since it is zero at XO. If instead we look at the gradient of the function f(x), which is positive at xo and increases as xo -, 0, it tells us that the downhill direction is to decrease XO, which is also the case in S(x). In addition, the magnitude of the gradient tells us how close we are to a step down in S(x). Therefore, we can use that gradient as a heuristic hint as to which direction (and how close) a step down would be. This heuristic hint is what we used as the pseudo-gradient in our gradient update calculation.
sf-'/&!.
Zheng Zeng, Rodney Goodman, and Padhraic Smyth
986
..
c
oOAO .\,.;o,,*9, 0.20
t 0.00t, 0.00
,
,
,
,
0.20
0.40
0.60
0.80
-
.fl
0.00
0.00
30
0.40
0.60
080
1 1.00h0
h3
oao 0.60 1.00
0.40
0.60
-
0.20-
0.00
c
J.bo
*
O0.20 AI
oio o i o
0.k
oh
1.WhO
Figure 4 Discretized network learning Tomita grammar #4. (a) ho-h3 during 1st epoch of training. (b)ho-h3 during 15th epoch of training. (c) h&3 after 27 epochs when training succeeds, weights are fixed. (d) SO-s3, the discretized copy of h o d 3 in (c).
5 Experimental Results Shown in Figure 4a-c are the ho-h3 activation-space records of the learning process of a discretized network (h values are the undiscretized values
Learning Finite State Machines
987
from the sigmoids). The underlying grammar is again the Tomita Grammar #4. The parameters of the network and the training set are the same as in the previous case. Again, any of the other 2-Dplots from any run in learning any of the grammar in the experiments could have been used here. Figure 4c is the final result after learning, where the weights are fixed. Notice that there are only a finite number of points in the final plot in the analog activation h-space due to the discretization. Figure 4d shows the discretized value plot in SO-S~,where only three points can be seen. Each point in the discretized activation S-space is automatically defined as a distinct state, no point is shared by any of the states. The transition rules are calculated as before, and an internal state machine in the network is thus constructed. In this manner, the network performs self-clustering. For this example, six points are found in S-space, so a six-state-machine is constructed as shown in Figure 5a. Not surprisingly this machine reduces by Moore’s algorithm to a minimum machine with four states, which is exactly the Tomita Grammar #4 (Fig. 5b). Similar results were observed for all the other grammars in the experiments. There are several advantages in introducing discretization into the network: 1. Once the network has successfully learned the state machine from the training set, its internal states are stable. The network will always class@ input strings correctly, independent of the’lengths of these strings.
2. No clustering is needed to extract out the state machine, since instead of using vague clusters as its states, the network has formed distinct, isolated points as states. Each point in activation space is a distinct state. The network behaves exactly like a state machine. 3. Experimental results show that the size of the state machines extracted out in this approach, which need not be decided manually (no need to choose k for k-means) as in the previous undiscretized case, is much smaller than found previously by the clustering method.
It should be noted that convergence has a different meaning in the case of training discrete networks as opposed to the case of training analog networks. In the analog networks’ case, learning is considered to have converged when the error for each sample is below a certain error tolerance level. In the case of discrete networks, however, learning is stopped and considered to have converged only when zero error is obtained on all samples in the training set. In the experiments reported in this paper the analog tolerance level was set to 0.2. The discretized networks took on average 30% longer to train in terms of learning epochs compared to the analog networks for this specific error tolerance level.
988
Zheng Zeng, Rodney Goodman, and Padhraic Smyth
Figure 5 Extracted state machine from the discretized network after learning Tomita grammar #4 (doublecircle means “accept”state, single circle means “reject’’ state). (a) Six-state machine extracted directly from the discrete activation space. (b)Equivalent minimal machine of (a). 6 Conclusion
In this paper we explored the formation of clusters in hidden unit activation space as an internal state representation for second-order recurrent networks that learn regular grammars. These states formed by such a network during learning are not a stable representation, i.e., when long strings are seen by the network the states merge into each other and eventually become indistinguishable. We suggested introducing hard-limiting threshold discretization into the network and presented a pseudo-gradient learning method to train such a network. The method is heuristically plausible and experimental results show that the network has similar capabilities in learning finite
Learning Finite State Machines
989
state machines as the original second-order network, but is stable regardless of string length since the internal representation of states in this network consists of isolated points in activation space. The proposed pseudogradient learning method suggests a general approach for training networks with threshold activation functions.
Acknowledgments The research described in this paper was supported in part by ONR and ARPA under Grants AFOSR-90-0199 and N00014-92-J-1860. In addition this work was carried out in part by the Jet Propulsion Laboratories, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.
References Angluin, D. 1972. Inference of reversible languages. 1.Assoc. Comput. Machin. 29(3), 741-765. Angluin, D. 1978. On the complexity of minimum inference of regular sets. Inform. Control 39, 337-350. Angluin, D., and Smith, C. H. 1983. Inductive inference: theory and methods. ACM Computing Survey, 15(3), 237. Carroll, J., and Long, D. 1989. Theory of Finite Automata. Prentice Hall, Englewood Cliffs, NJ. Cleeremans, A., Servan-Schreiber, D., and McClelland, J. L. 1989. Finite state automata and simple recurrent networks. Neural Comp. 1,372-381. Elman, J. L. 1990. Finding structure in time. Cog. Sci. 14, 179-211. Elman, J. L. 1991. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learn. 7(2/3), 195-225. Fahlman, S.E.1990. The recurrent cascade-correlationarchitecture. In Advances in Neural Information Processing Systems, pp. 190-196. Fu, K. S. 1982. Syntactic Pattern Recognition and Applications. Prentice Hall, Englewood Cliffs, NJ. Giles, C. L., Sun, G. Z., Chen, H. H., Lee, Y. C., and Chen, D. 1990. Higher order recurrent networks and grammatical inference. In Advances in Neural lnformation Processing Systems, pp. 380-387. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., and Lee, Y. C. 1992. Second-order recurrent neural networks. Neural Comp. 4(3), 393-405. Gold, E.M.1972. System identification via state characterization. Automatica 8, 621-636. Gold, E. M. 1978. Complexity of automaton identification from given data. lnformation and Control 37, 302-320. Hopcroft, J.E., and Ullman, J. D. 1979.Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading MA.
990
Zheng Zeng, Rodney Goodman, and Padhraic Smyth
Jordan, M. I. 1986. Serial order: A parallel distributed processing approach. Tech. Rep. No.8604, San Diego: University of California, Institute for Cognitive Science. Kudo, M., and Shimbo, M. 1988. Efficient regular grammatical inference techniques by the use of partial similarities and their logical relationships. Pattern Recog. 21(4), 401409. Muggleton, S. 1990. Grammatical Induction Theory. Addison-Wesley, Turing Institute Press, Reading, MA. Pollack, J. 8. 1991. The induction of dynamical recognizers. Machine h r n . 7(2/3), 227-252. Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. 1986. In Parallel Distributed Prowsing, pp. 354-361.The MIT Press. Servan-Schreiber,D., Cleeremans, A., and McClelland, J. L. 1991. Graded state machines: The representation of temporal contingenciesin simple recurrent networks. Machine Learn. 7(2/3),161-193. Tomita, M. 1982. Dynamic construction of finite-state automata from examples using hill climbing. In Proceedings of the Fourth Annual Cognitive Science Conference, p. 105. Williams, R. J., and Zipser, D. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Comp. 1(2), 270-280. Received 15 June 1992;accepted 8 March 1993.
This article has been cited by: 2. Takahumi Oohori, Hidenori Naganuma, Kazuhisa Watanabe. 2007. A New Backpropagation Learning Algorithm for Layered Neural Networks with Nondifferentiable UnitsA New Backpropagation Learning Algorithm for Layered Neural Networks with Nondifferentiable Units. Neural Computation 19:5, 1422-1435. [Abstract] [PDF] [PDF Plus] 3. Hidenori Naganuma, Takahumi Oohori, Kazuhisa Watanabe. 2007. A new error backpropagation learning algorithm for a layered neural network with nondifferentiable units. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 90:5, 40-49. [CrossRef] 4. Henrik Jacobsson. 2006. The Crystallizing Substochastic Sequential Machine Extractor: CrySSMExThe Crystallizing Substochastic Sequential Machine Extractor: CrySSMEx. Neural Computation 18:9, 2211-2255. [Abstract] [PDF] [PDF Plus] 5. V. P. Plagianakos, G. D. Magoulas, M. N. Vrahatis. 2006. Evolutionary training of hardware realizable multilayer perceptrons. Neural Computing and Applications 15:1, 33-40. [CrossRef] 6. Henrik Jacobsson . 2005. Rule Extraction from Recurrent Neural Networks: ATaxonomy and ReviewRule Extraction from Recurrent Neural Networks: ATaxonomy and Review. Neural Computation 17:6, 1223-1263. [Abstract] [PDF] [PDF Plus] 7. A. Vahed, C. W. Omlin. 2004. A Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural NetworksA Machine Learning Method for Extracting Symbolic Knowledge from Recurrent Neural Networks. Neural Computation 16:1, 59-71. [Abstract] [PDF] [PDF Plus] 8. Peter Tiňo , Bill G. Horne , C. Lee Giles . 2001. Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks)Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks). Neural Computation 13:6, 1379-1414. [Abstract] [PDF] [PDF Plus] 9. Rafael C. Carrasco , Mikel L. Forcada , M. Ángeles Valdés-Muñoz , Ramón P. Ñeco . 2000. Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid UnitsStable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units. Neural Computation 12:9, 2129-2174. [Abstract] [PDF] [PDF Plus] 10. A. Blanco, M. Delgado, M. C. Pegalajar. 2000. Extracting rules from a (fuzzy/crisp) recurrent neural network using a self-organizing map. International Journal of Intelligent Systems 15:7, 595-621. [CrossRef]
11. S. Lawrence, C.L. Giles, S. Fong. 2000. Natural language grammatical inference with recurrent neural networks. IEEE Transactions on Knowledge and Data Engineering 12:1, 126-140. [CrossRef] 12. C.L. Giles, C.W. Omlin, K.K. Thornber. 1999. Equivalence in knowledge representation: automata, recurrent neural networks, and dynamical fuzzy systems. Proceedings of the IEEE 87:9, 1623-1640. [CrossRef] 13. C.W. Omlin, K.K. Thornber, C.L. Giles. 1998. Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks. IEEE Transactions on Fuzzy Systems 6:1, 76-89. [CrossRef] 14. Alan D. Blair, Jordan B. Pollack. 1997. Analysis of Dynamical RecognizersAnalysis of Dynamical Recognizers. Neural Computation 9:5, 1127-1142. [Abstract] [PDF] [PDF Plus] 15. S.C. Kremer. 1996. Comments on "Constructive learning of recurrent neural networks: limitations of recurrent cascade correlation and a simple solution". IEEE Transactions on Neural Networks 7:4, 1047-1051. [CrossRef] 16. Christian W. Omlin, C. Lee Giles. 1996. Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid DiscriminantsStable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants. Neural Computation 8:4, 675-696. [Abstract] [PDF] [PDF Plus] 17. Paolo Frasconi, Marco Gori, Marco Maggini, Giovanni Soda. 1996. Representation of finite state automata in Recurrent Radial Basis Function networks. Machine Learning 23:1, 5-32. [CrossRef] 18. Kam-Chuen Jim, C.L. Giles, B.G. Horne. 1996. An analysis of noise in recurrent neural networks: convergence and generalization. IEEE Transactions on Neural Networks 7:6, 1424-1438. [CrossRef] 19. R. Alquézar , A. Sanfeliu . 1995. An Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural NetworksAn Algebraic Framework to Represent Finite State Machines in Single-Layer Recurrent Neural Networks. Neural Computation 7:5, 931-949. [Abstract] [PDF] [PDF Plus] 20. Peter Tiňo , Jozef Šajda . 1995. Learning and Extracting Initial Mealy Automata with a Modular Neural Network ModelLearning and Extracting Initial Mealy Automata with a Modular Neural Network Model. Neural Computation 7:4, 822-844. [Abstract] [PDF] [PDF Plus]
991
Index Volume 5 By Author Abbott, L. E and LeMasson, G. Analysis of Neuron Models with Dynamically Regulated Conductances (Article)
5(6):823-842
Abu-Mostafa, Y.S. Hints and the VC Dimension (Letter)
5(2):278-288
Amari, S. and Murata, N. Statistical Theory of Learning Curves under Entropic Loss Criterion (Letter)
5(1):140-153
Amit, D. J.
- See Griniasty, M.
Atick, J. J. and Redlich, A. N. Convergent Algorithm for Sensory Receptive Field Development (Letter)
5(1h45-60
Ayala, G. F. - Migliore, M. Atlan, H. - Rosenberg, C. Baddeley, R. J. - See Cairns, D. E. Baldi, P. and Chauvin, Y. Neural Networks for Fingerprint Recognition (Letter)
5(3):402418
Back, A. D. and Tsoi, A. C. A Simplified Gradient Algorithm for IIR Synapse Multilayer Perceptrons (Letter)
5(3):45&462
Bartlett, I? L. Vapnik-Chervonenkis Dimension Bounds for Two- and Three-Layer Networks (Note)
5(3):371-373
Becker, S. and Hinton, G. E. Learning Mixture Models of Spatial Coherence (Letter)
5(2):267-277
Bialek, W. - See Kruglyak, L. Blakemore, C. - See Nicoll, A. Blazis, D. E. J., Fischer, T. M. and Carew, T. J. A Neural Network Model of Inhibitory Information Processing in Aplysia (Letter)
5(2):213-227
992
Borst, A., Engelhaaf, M., and Seung, H. S. Two-Dimensional Motion Perception in Flies (Letter)
Index
5(6):856-868
Bottou, L. - Vapnik, V. Bower, J. M. - See De Schutter, E. Bowtell, G. - See Ferrar, C. H. Bromley, J. and Denker, J. S. Improving Rejection Performance on Handwritten Digits by Training with "Rubbish (Note) Buhmann, J. and Kuhnel, H. Complexity Optimized Data Clustering by Competitive Neural Networks (Letter) Cairns, D. E., Baddeley, R. J., and Smith, L. S. Constraints on Synchronizing Oscillator Networks (Letter)
5(3):367-370
5(3):75-88
5(2):260-266
Carew, T. J. - See Blazis, D. E. J. Changeux, J. P. - Kerzberg, M. Chauvin, Y. - See Baldi, l? Chiel, H. J. - See Srinivasan, R. Chen, A. M., Lu, H., and Hecht-Nielsen, R. On the Geometry of Feedforward Neural Network Error Surfaces (Letter)
5(6):91&927
Cho, S. and Reggia, J. A. Learning Competition and Cooperation (Letter)
5(2)~242-259
Choi, H. - See Phatak, D. S. Conwell, P. R. - See Cotter, N. E. Cotter, N. E. and Conwell, P. R. Universal Approximation by Phase Series and Fixed-Weight Networks (Note)
5(3)~359-362
Dayan, P. Arbitrary Elastic Topologies and Ocular Dominance (Letter)
5(3):392-401
Dayan, P. Improving Generalization for Temporal Difference Learning: The Successor Representation (Letter)
5(4):613-624
Index
993
Dayan, P. and Sejnowski, T. J. The Variance of Covariance Rules for Associative Matrix Memories and Reinforcement Learning (Note)
5(2):205-209
Deco, G. and Ebmeyer, J. Coarse Coding Resource-Allocating Network (Letter)
5(1):105-114
DeFelice, L. J. - See Strassberg, A. F. Denby, B. The Use of Neural Networks in High-Energy Physics (Review)
5(4):505-549
Denker, J. S. - See Bromley, J. De Schutter, E. and Bower, J. M. Sensitivity of Synaptic Plasticity to the CA2+ Permeability of NMDA Channels: A Model of Long-Term Potentiation in Hippocampal Neurons (Letter)
5(5):681-694
Dorronsoro, J. R. - Lopez, V. Dreyfus, G. - Linster, C.
Dreyfus, G. - See Nerrand, 0. Ebmeyer, J. - See Deco, G. Edelman, S. - See Weiss, Y. Elias, J. G. Artificial Dendritic Trees (Letter)
5(4):64&664
Engelhaaf, M. - See Borst, A. Erel, J. - See Rosenberg, C. Fahle, M. - See Weiss, Y. Ferrar, C. H., Williams, T. L., and Bowtell, G. The Effects of Cell Duplication and Noise in a Pattern Generating Network (Letter)
5(4):587-596
Fischer, T. M. - See Blazis, D. E. J. Floreen, P. and Orponen, P. Attraction Radii in Binary Hopfield Nets are Hard to Compute (Letter)
5(5):812-821
Index
994
Gelenbe, E. Learning in the Recurrent Random Neural Network (Letter)
5(1)A54164
Gold, J. I. - See Intrator, N. Golea, M. and Marchand, M. On Learning Perceptrons with Binary Weights (Letter)
5(5):767-782
Goodman, R. M. - See Zeng, Z. Govil, S. - See Mukhopadhyay, S. Grannan, E. R., Kleinfeld, D. and Sompolinsky, H. Stimulus-Dependent Synchronization of Neuronal Assemblies (Article)
5(4):55&569
Griniasty, M., Tsodyks, M. V., and Amit, D. J. Conversion of Temporal Correlations Between Stimuli to Spatial Correlations Between Attractors (Article)
5(1)A-17
Hall, T. J. - Kendall, G. D. Hasselmo, M. E. Acetylcholine and Learning in a Cortical Associative Memory (Letter)
5(1):32-44
Haykin, S. - See Leung, H. Hecht-Nielsen, R.
- See Chen, A. M.
Herrmann, M. - See Horn, D. Hinton, G. E. - See Becker, S. Horn, D., Ruppin, E., Usher, M. and Herrmann, M. Neural Network Modeling of Memory Deterioration (Letter)
5(5):736749
Huerta, R. - See Lopez, V. Intrator, N. Combining Exploratory Projection Pursuit and Projection Pursuit Regression with Application to Neural Networks (Letter) Intrator, N. and Gold J. I. Three-Dimensional Object Recognition Using an Unsupervised BCM Network The Usefulness of Distinguishing Features (Letter)
5(3):443-457
5(1):61-74
Index
995
Kadirkamanathan, V. and Niranjan, M. A Function Estimation Approach to Sequential Learning with Neural Networks (Letter)
5(6):954-975
Kendall, G. D. and Hall, T. J. Optimal Network Construction by Minimum Description Length (Note)
5(2):210-212
Kerlirzin, P. and Vallet, F. Robustness in Multilayer Perceptrons (Letter)
5(3):473482
Kerszberg, M. and Changeux, J. P. A Model for Motor Endplate Morphogenesis: Diffusible Morphogens, Transmembrane Signaling, and Compartmentalized Gene Expression (Article)
5(3):341-358
Kersberg, M. - See Linster, C. Kidder, J. N. and Seligson, D. Fast Recognition of Noisy Digits (Letter)
5(6):885-892
Kim, L. S. - See Mukhopadhyay, S. Kleinfeld, D. - See Grannan, E. R. Konen, W.and Von der Malsberg, C. Learning to Generalize from Single Examples in the Dynamic Link Architecture (Letter)
5(5):719-735
Koren, I. - See Phatak, D. S. Kottas, J. A. Training Periodic Sequences Using Fourier Series Error Criterion (Letter)
5(1):115-131
Kruglyak, L. and Bialek, W. Statistical Mechanics for a Network of Spiking Neurons (Letter) 5(1):21-31 Kuhnel, H. - See Buhmann, J. Lappe, M. and Rauschecker, J. P. A Neural Network for the Processing of Optic Flow from Ego-Motion in Man and Higher Mammals 5(3):374391 (Letter) LeMasson, G. - See Abbott, L. F. Leung, H. and Haykin, S. Rational Function Neural Network (Letter)
5(6):928-938
996
Lin, J. and Unbehauen, R. On the Realization of a Kolmogorov Network (Note)
Index
5(1):18-20
Linster, C., Masson, C. Kerszberg, M., Personnaz, L., and Dreyfus, G. Computational Diversity in a Formal Model of the Insect Olfactory Macroglomerulus (Letter) Lopez, V.,Huerta, R., and Dorronsoro, J. R. Recurrent and Feedforward Polynomial Modeling of Coupled Time Series (Letter) Lu, H. - See Chen, A. M. Marchand, M. - Golea, M. Marcos, S. - See Nerrand, 0. Martin, G. L. Centered-Object Integrated Segmentation and Recognition of Overlapping Handprinted Characters (Letter) Martinez, D. - See Van Hulle, M. M. Migliore, M. and Ayala, G. F. A Kinetic Model of Short- and Long-Term Potentiation (Letter)
5(4):636-647
Mukhopadhyay, S., Roy, A,, Kim, L. S., and Govil, S. A Polynomial Time Algorithm for Generating Neural Networks for Pattern Classification: Its Stability Properties and Some Test Results (Letter)
5(2):317-330
Murata, N. - See Amari, S. Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G., and Marcos, S. Neural Networks and Nonlinear Adaptive Filtering: Unifying Concepts and New Algorithms 5(2):165-199 (Review) Nicoll, A. and Blakemore, C. Patterns of Local Connectivity in the Neocortex (Letter) Niebur, E. - See Usher, M. Niranjan, M. - See Kadirkamanathan, V.
5(5):665-680
Index Ohlsson, M., Peterson, C., and Soderberg, B. Neural Networks for Optimization Problems with Inequality Constraints: The Knapsack Problem (Letter)
997
5(2):331-339
Orponen, l? - See Floreen, l? Park, J. and Sandberg, I. W. Approximation and Radial-Basis-Function Networks (Letter)
5(3):305-316
Pentland, A. I? Surface Interpolation Networks (Letter)
5(3):430-442
Personnaz, L. - See Linster, C. Peterson, C. - See Ohlsson, M. Personnaz, L. - See Nerrand, 0. Phatak, D. S., Choi, H., and Koren, I. Construction of Minimal n-2-n Encoders for Any n (Letter)
5(5):783-794
Prelinger, D. - See Schmidhuber, J. Rauschecker, J. P. - See Lappe, M. Redish, A. D. - See Touretzky, D. S. Redlich, A. N. Redundancy Reduction as a Strategy for Unsupervised Learning (Letter)
5(2):289-304
Redlich, A. N. Supervised Factorial Learning (Letter)
5(5):750-766
Redlich, A. N. - See also Atick, J. J. Reggia, J. A. - See Cho, S. Rognvaldsson, T. Pattern Discrimination Using Feedforward Networks: A Benchmark Study of Scaling Behavior (Letter)
5(3):483-491
Rosenberg, C., Erel, J., and Atlan, H. A Neural Network That Learns to Interpret Myocardial Planar Thallium Scintigrams (Letter)
5(3):492-502
Roussel-Ragot, I? - See Nerrand, 0. Roy, A. - See Mukhopadhyay, S.
998
Index
Ruppin, E. - See Horn, D. Sandberg, I. W. - See Park, J. Schmidhuber, J. and Prelinger, D. Discovering Predictable Classifications (Letter)
5(4):625435
Schuster, H. G. - See Usher, M. Sejnowski, T. J. - See Dayan, P. Seligson, D. - Kidder, J. Sereno, M. E. - See Zhang, K. Sereno, M. I. - See Zhang, K. Seung, H. S. - See Botst, A. Smith, L. E. - See Cairns, D. E, Smyth, P. - See Zeng, Z. Soderberg, 8. - See Ohlsson, M. Sompolinsky, H. - Grannan, E. R. Srinivasan, R. and Chiel, H. J. Fast Calculation of Synaptic Conductances (Note)
5(2):20&204
Strassberg, A. F. and DeFelice, L. J. Limitations of the Hodgkin-Huxley Formalism: Effects of Single Channel Kinetics on Transmembrane Voltage Dynamics (Letter)
5(6):843-855
Takahashi, Y. Generalization and Approximation Capabilities of Multilayer Networks (Letter)
5(1):132-139
Tanaka, T. and Yamada M. The Characteristics of the Convergence Time of Associative Neural Networks (Letter)
5(3):463-472
Touretzky, D. S., Redish, A. D., and Wan, H. S. Neural Representation of Space Using Sinusoidal Arrays (Letter)
5(6):869-884
Tsodyks, M. V. - See Griniasty, M. Tsoi, A. C. - See Back, A. D. Unbehauen, R. - Lin, J.
Index
Usher, M., Schuster, H. G., Niebur E. Dynamics of Populations of Integrate-and-Fire Neurons, Partial Synchronization and Memory (Letter)
999
5(4):570-586
Usher, M. - See Horn, D. Vallet, F. - See Kerlirzin, I? Van Hulle, M. M. and Martinez, D. On an Unsupervised Learning Rule for Scalar Quantization Following the Maximum Entropy Principle (Letter)
5(6)~939-953
Vapnik, V. and Bottou, L. Local Algorithms for Pattern Recognition and Dependencies Estimation
5(6):893-909
Von der Malsberg, C. - See Konen, W. Wan, H. S . - See Touretzky, D. S. Weiss, Y., Edelman, S., and Fahle, M. Models of Perceptual Learning in Vernier Hyperacuity (Letter)
5(5):695-718
Williams, T. L. - See Ferrar, C. H. Wong, Y. Clustering Data by Melting (Letter)
5(1):89-104
Yamada, M. - See Tanaka, T. Yang, L. and Yu, W. Backpropagation with Homotopy (Note)
5(3):363-366
Yu, W. - See Yang, L. Zeng, Z., Goodman, R., and Smyth, P. Learning Finite State Machines With SelfClustering Recurrent Networks (Letter)
5(6):976-990
Zhang, K., Sereno, M. I., and Sereno, M. E. Emergence of Position-Independent Detectors of Sense of Rotation and Dilation with Hebbian Learning: An Analysis (Letter)
5(4):597-612